Data Manipulation

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1 – Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

	# %>% OPERATOR ----------------------------------------------------------------------

	# with %>% operator
	hflights %>%
	mutate(diff = TaxiOut - TaxiIn) %>%
	filter(!is.na(diff)) %>%
	summarise(avg = mean(diff))

	# without %>% operator
	# arguments get further and further apart
	summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)),
	avg = mean(diff))


	# with %>% operator
	d <- hflights %>%
	select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# without %>% operator
	d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime),
	RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# Filter and summarise d
	d %>%
	filter(!is.na(mph), mph < 70) %>%
	summarise(n_less = n(), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Let's define preferable flights as flights that are 150% faster than driving,
	# i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or
	# diverted flights are less preferable than driving.


	# ADVANCED PIPING EXERCISES
	# Use one single piped call to print a summary with the following variables:

	# n_non - the number of non-preferable flights in hflights,
	# p_non - the percentage of non-preferable flights in hflights,
	# n_dest - the number of destinations that non-preferable flights traveled to,
	# min_dist - the minimum distance that non-preferable flights traveled,
	# max_dist - the maximum distance that non-preferable flights traveled

	hflights %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>%
	filter(mph < 105 \| Cancelled == 1 \| Diverted == 1) %>%
	summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Use summarise() to create a summary of hflights with a single variable, n,
	# that counts the number of overnight flights. These flights have an arrival
	# time that is earlier than their departure time. Only include flights that have
	# no NA values for both DepTime and ArrTime in your count.

	hflights %>%
	mutate(overnight = (ArrTime < DepTime)) %>%
	filter(overnight == TRUE) %>%
	summarise(n = n())

view raw pipingOperator.r hosted with ❤ by GitHub

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

	# group_by() -------------------------------------------------------------------------

	# Generate a per-carrier summary of hflights with the following variables: n_flights,
	# the number of flights flown by the carrier; n_canc, the number of cancelled flights;
	# p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of
	# flights whose delay does not equal NA. Next, order the carriers in the summary from
	# low to high by their average arrival delay. Use percentage of flights cancelled to
	# break any ties. Which airline scores best based on these statistics?

	hflights %>%
	group_by(UniqueCarrier) %>%
	summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights,
	avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay)

	# Generate a per-day-of-week summary of hflights with the variable avg_taxi,
	# the average total taxiing time. Pipe this summary into an arrange() call such
	# that the day with the highest avg_taxi comes first.

	hflights %>%
	group_by(DayOfWeek) %>%
	summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>%
	arrange(desc(avg_taxi))

view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

	# Combine group_by with mutate-----

	# First, discard flights whose arrival delay equals NA. Next, create a by-carrier
	# summary with a single variable: p_delay, the proportion of flights which are
	# delayed at arrival. Next, create a new variable rank in the summary which is a
	# rank according to p_delay. Finally, arrange the observations by this new rank
	hflights %>%
	filter(!is.na(ArrDelay)) %>%
	group_by(UniqueCarrier) %>%
	summarise(p_delay = sum(ArrDelay >0)/n()) %>%
	mutate(rank = rank(p_delay)) %>%
	arrange(rank)

	# n a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA).
	# Next, create a by-carrier summary with a single variable: avg, the average delay
	# of the delayed flights. Again add a new variable rank to the summary according to
	# avg. Finally, arrange by this rank variable.
	hflights %>%
	filter(!is.na(ArrDelay), ArrDelay > 0) %>%
	group_by(UniqueCarrier) %>%
	summarise(avg = mean(ArrDelay)) %>%
	mutate(rank = rank(avg)) %>%
	arrange(rank)

	# Advanced group_by exercises-------------------------------------------------------

	# Which plane (by tail number) flew out of Houston the most times? How many times?
	# Name the column with this frequency n. Assign the result to adv1. To answer this
	# question precisely, you will have to filter() as a final step to end up with only
	# a single observation in adv1.
	# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
	adv1 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n = n()) %>%
	filter(n == max(n))

	# How many airplanes only flew to one destination from Houston? adv2
	# How many airplanes only flew to one destination from Houston?
	# Save the resulting dataset in adv2, that contains only a single column,
	# named nplanes and a single row.
	adv2 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n_dest = n_distinct(Dest)) %>%
	filter(n_dest == 1) %>%
	summarise(nplanes = n())

	# Find the most visited destination for each carrier and save your solution to adv3.
	# Your solution should contain four columns:
	# UniqueCarrier and Dest,
	# n, how often a carrier visited a particular destination,
	# rank, how each destination ranks per carrier. rank should be 1 for every row,
	# as you want to find the most visited destination for each carrier.

	adv3 <- hflights %>%
	group_by(UniqueCarrier, Dest) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

	# Find the carrier that travels to each destination the most: adv4
	# For each destination, find the carrier that travels to that destination the most.
	# Store the result in adv4. Again, your solution should contain 4 columns:
	# Dest, UniqueCarrier, n and rank.

	adv4 <- hflights %>%
	group_by(Dest, UniqueCarrier) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

view raw group_by_miscellaneous.R hosted with ❤ by GitHub

dplyr is one of the packages in R that makes R so loved by data scientists. It has three main goals:

Identify the most important data manipulation tools needed for data analysis and make them easy to use in R.
Provide blazing fast performance for in-memory data by writing key pieces of code in C++.
Use the same code interface to work with data no matter where it’s stored, whether in a data frame, a data table or database.

Introduction to the dplyr package and the tbl class
This post is mostly about code. If you’re interested in learning dplyr I recommend you type in the commands line by line on the R console to see first hand what’s happening.

	# INTRODUCTION TO dplyr AND tbls
	# Load the dplyr package
	library(dplyr)

	# Load the hflights package
	library(hflights)

	# Call both head() and summary() on hflights
	head(hflights)
	summary(hflights)


	# Convert the hflights data.frame into a hflights tbl
	hflights <- tbl_df(hflights)

	# Display the hflights tbl
	hflights

	# Create the object carriers, containing only the UniqueCarrier variable of hflights
	carriers <- hflights$UniqueCarrier


	# Use lut to translate the UniqueCarrier column of hflights and before doing so
	# glimpse hflights to see the UniqueCarrier variablle
	glimpse(hflights)

	lut <- c("AA" = "American", "AS" = "Alaska", "B6" = "JetBlue", "CO" = "Continental",
	"DL" = "Delta", "OO" = "SkyWest", "UA" = "United", "US" = "US_Airways",
	"WN" = "Southwest", "EV" = "Atlantic_Southeast", "F9" = "Frontier",
	"FL" = "AirTran", "MQ" = "American_Eagle", "XE" = "ExpressJet", "YV" = "Mesa")
	hflights$UniqueCarrier <- lut[hflights$UniqueCarrier]
	# Now glimpse hflights to see the change in the UniqueCarrier variable
	glimpse(hflights)

	# Fill up empty entries of CancellationCode with 'E'
	# To do so, first index the empty entries in CancellationCode
	cancellationEmpty <- hflights$CancellationCode == ""
	# Assign 'E' to the empty entries
	hflights$CancellationCode[cancellationEmpty] <- 'E'

	# Use a new lookup table to create a vector of code labels. Assign the vector to the CancellationCode column of hflights
	lut = c('A' = 'carrier', 'B' = 'weather', 'C' = 'FFA', 'D' = 'security', 'E' = 'not cancelled')
	hflights$CancellationCode <- lut[hflights$CancellationCode]

	# Inspect the resulting raw values of your variables
	glimpse(hflights)

view raw introduction.R hosted with ❤ by GitHub

Select and mutate
dplyr provides grammar for data manipulation apart from providing data structure. The grammar is built around 5 functions (also referred to as verbs) that do the basic tasks of data manipulation.

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

dplyr functions do not change the dataset. They return a new copy of the dataset to use.

To answer the simple question whether flight delays tend to shrink or grow during a flight, we can safely discard a lot of the variables of each flight. To select only the ones that matter, we can use select()

	hflights[c('ActualElapsedTime','ArrDelay','DepDelay')]
	# Equivalently, using dplyr:
	select(hflights, ActualElapsedTime, ArrDelay, DepDelay)

	# Print out a tbl with the four columns of hflights related to delay
	select(hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay)

	# Print out hflights, nothing has changed!
	hflights

	# Print out the columns Origin up to Cancelled of hflights
	select(hflights, Origin:Cancelled)

	# Find the most concise way to select: columns Year up to and
	# including DayOfWeek, columns ArrDelay up to and including Diverted
	# Answer to last question: be concise!
	# You may want to examine the order of hflight's column names before you
	# begin with names()
	names(hflights)
	select(hflights, -(DepTime:AirTime))

view raw verbs01.R hosted with ❤ by GitHub

dplyr comes with a set of helper functions that can help you select variables. These functions find groups of variables to select, based on their names. Each of these works only when used inside of select()

starts_with(“X”): every name that starts with “X”
ends_with(“X”): every name that ends with “X”
contains(“X”): every name that contains “X”
matches(“X”): every name that matches “X”, where “X” can be a regular expression
num_range(“x”, 1:5): the variables named x01, x02, x03, x04 and x05
one_of(x): every name that appears in x, which should be a character vector

	# Helper functions used with dplyr

	# Print out a tbl containing just ArrDelay and DepDelay
	select(hflights, ArrDelay, DepDelay)
	# Use a combination of helper functions and variable names to print out
	# only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode
	# columns of hflights
	select(hflights, UniqueCarrier, FlightNum, contains("Tail"), contains("Cancel"))

	# Find the most concise way to return the following columns with select and its
	# helper functions: DepTime, ArrTime, ActualElapsedTime, AirTime, ArrDelay,
	# DepDelay. Use only helper functions
	select(hflights, ends_with("Time"), ends_with("Delay"))

view raw verbs02.R hosted with ❤ by GitHub

In order to appreciate the usefulness of dplyr, here are some comparisons between base R and dplyr

	# Some comparisons to basic R
	# both hflights and dplyr are available

	ex1r <- hflights[c("TaxiIn","TaxiOut","Distance")]
	ex1d <- select(hflights, TaxiIn, TaxiOut, Distance)

	ex2r <- hflights[c("Year","Month","DayOfWeek","DepTime","ArrTime")]
	ex2d <- select(hflights, Year:ArrTime, -DayofMonth)

	ex3r <- hflights[c("TailNum","TaxiIn","TaxiOut")]
	ex3d <- select(hflights, TailNum, contains("Taxi"))

view raw comparisons01.R hosted with ❤ by GitHub

mutate() is the second of the five data manipulation functions. mutate() creates new columns which are added to a copy of the dataset.

	# Add the new variable ActualGroundTime to a copy of hflights and save the result as g1.
	g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)

	# Add the new variable GroundTime to a g1. Save the result as g2.
	g2 <- mutate(g1, GroundTime = TaxiIn + TaxiOut)

	# Add the new variable AverageSpeed to g2. Save the result as g3.
	g3 <- mutate(g2, AverageSpeed = Distance / AirTime * 60)

	# Print out g3
	g3

view raw verbs03.r hosted with ❤ by GitHub

So far we have added variables to hflights one at a time, but we can also use mutate() to add multiple variables at once.

	# Add a second variable loss_percent to the dataset: m1
	m1 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = ((ArrDelay - DepDelay)/DepDelay)*100)

	# mutate() allows you to use a new variable while creating a next variable in the same call
	# Copy and adapt the previous command to reduce redendancy: m2
	m2 <- mutate(hflights, loss = ArrDelay - DepDelay, loss_percent = (loss/DepDelay) * 100 )

	# Add the three variables as described in the third instruction: m3
	m3 <- mutate(hflights, TotalTaxi = TaxiIn + TaxiOut, ActualGroundTime = ActualElapsedTime - AirTime, Diff = TotalTaxi - ActualGroundTime)

view raw verbs04.r hosted with ❤ by GitHub

Discovering Python & R

— my journey as a worker bee in quant finance

Data Manipulation in R with dplyr – Part 3

Data Manipulation in R with dplyr – Part 1

Share this:

Share this: