# Scraping the Daily India Covid-19 Tracker for CSV Data

This is a very short post that will be very useful to help you quickly set up your COVID-19 datasets. I’m sharing code at the end of this post that scrapes through all CSV datasets made available by COVID19-India API.

Copy paste this standalone script into your R environment and get going!

There are 15+ CSV files on the India COVID-19 API website. raw_data3 is actually a live dataset and more can be expected in the days to come, which is why a script that automates the data sourcing comes in handy.  Snapshot of the file names and the data dimensions as of today, 100 days since the first case was recorded in the state of Kerala —

My own analysis of the data and predictions are work-in-progress, going into a Github repo. Execute the code below and get started analyzing the data and fighting COVID-19!

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 rm(list = ls()) # Load relevant libraries ----------------------------------------------------- library(stringr) library(data.table) # ============================================================================= # COVID 19-India API: A volunteer-driven, crowdsourced database # for COVID-19 stats & patient tracing in India # ============================================================================= url <- # List out all CSV files to source -------------------------------------------- html <- paste(readLines(url), collapse="\n") pattern <- matched <- unlist(str_match_all(string = html, pattern = pattern)) # Downloading the Data -------------------------------------------------------- covid_datasets <- lapply(as.list(matched), fread) # Naming the data objects appropriately --------------------------------------- exclude_chars <- dataset_names <- substr(x = matched, start = 1 + nchar(exclude_chars), stop = nchar(matched)- nchar(".csv")) # assigning variable names for(i in seq_along(dataset_names)){ assign(dataset_names[i], covid_datasets[[i]]) }

# Linear Algebra behind the lm() function in R

This post comes out of the blue, nearly 2 years since my last one. I realize I’ve been lazy, so here’s hoping I move from an inertia of rest to that of motion, implying, regular and (hopefully) relevant posts. I also chanced upon some wisdom while scrolling through my Twitter feed:

This blog post in particular was meant to be a reminder to myself and other R users that the much used lm() function in R (for fitting linear models) can be replaced with some handy matrix operations to obtain regression coefficients, their standard errors and other goodness-of-fit stats printed out when summary() is called on an lm object.

Linear regression can be formulated mathematically as follows:
$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}$,
$\mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I})$

$\mathbf{y}$ is the $\mathbf{n}\times \mathbf{1}$ outcome variable and $\mathbf{X}$ is the $\mathbf{n}\times \mathbf{(\mathbf{k}+1)}$ data matrix of independent predictor variables (including a vector of ones corresponding to the intercept). The ordinary least squares (OLS) estimate for the vector of coefficients $\mathbf{\beta}$ is:

$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}$

The covariance matrix can be obtained with some handy matrix operations:
$\textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$
given that $\textrm{Var}(AX) = A \times \textrm{Var}X \times A^{\prime}; \textrm{Var}(\mathbf{y}) = \mathbf{\sigma^2}$

The standard errors of the coefficients are basically $\textrm{Diag}(\sqrt{\textrm{Var}(\hat{\mathbf{\beta}})}) = \textrm{Diag}(\sqrt{\sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}})$ and with these, one can compute the t-statistics and their corresponding p-values.

Lastly, the F-statistic and its corresponding p-value can be calculated after computing the two residual sum of squares (RSS) statistics:

• $\mathbf{RSS}$ – for the full model with all predictors
• $\mathbf{RSS_0}$ – for the partial model ($\mathbf{y} = \mathbf{\mu} + \mathbf{\nu}; \mathbf{\mu} = \mathop{\mathbb{E}}[\mathbf{y}]; \mathbf{\nu} \sim N(0, \sigma_0^2 \mathbf{I})$) with the outcome observed mean as estimated outcome

$\mathbf{F} = \frac{(\mathbf{RSS_0}-\mathbf{RSS})/\mathbf{k}}{\mathbf{RSS}/(\mathbf{n}-\mathbf{k}-1)}$

I wrote some R code to construct the output from summarizing lm objects, using all the math spewed thus far. The data used for this exercise is available in R, and comprises of standardized fertility measures and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from 1888. Try it out and see for yourself the linear algebra behind linear regression.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

# Quick Way of Installing all your old R libraries on a New Device

I recently bought a new laptop and began installing essential software all over again, including R of course! And I wanted all the libraries that I had installed in my previous laptop. Instead of installing libraries one by one all over again, I did the following:

Step 1: Save a list of packages installed in your old computing device (from your old device).

 installed <- as.data.frame(installed.packages()) write.csv(installed, 'installed_previously.csv') 

This saves information on installed packages in a csv file named installed_previously.csv. Now copy or e-mail this file to your new device and access it from your working directory in R.

 installedPreviously <- read.csv('installed_previously.csv') baseR <- as.data.frame(installed.packages()) toInstall <- setdiff(installedPreviously, baseR) 

We now have a list of libraries that were installed in your previous computer in addition to the R packages already installed when you download R. So you now go ahead and install these libraries.

 install.packages(toInstall) 

That’s it. Save yourself the trouble installing packages one-by-one all over again.

# Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

The most conventional approach to determine structural breaks in longitudinal data seems to be the Chow Test.

From Wikipedia,

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war). In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

As shown in the figure below, regressions on the 2 sub-intervals seem to have greater explanatory power than a single regression over the data.

For the data above, determining the sub-intervals is an easy task. However, things may not look that simple in reality. Conducting a Chow test for structural breaks leaves the data scientist at the mercy of his subjective gaze in choosing a null hypothesis for a break point in the data.

Instead of choosing the breakpoints in an exogenous manner, what if the data itself could learn where these breakpoints lie? Such an endogenous technique is what Bai and Perron came up with in a seminal paper published in 1998 that could detect multiple structural breaks in longitudinal data. A later paper in 2003 dealt with the testing for breaks empirically, using a dynamic programming algorithm based on the Bellman principle.

I will discuss a quick implementation of this technique in R.

Brief Outline:

Assuming you have a ts object (I don’t know whether this works with zoo, but it should) in R, called ts. Then implement the following:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # assuming you have a 'ts' object in R # 1. install package 'strucchange' # 2. Then write down this code: library(strucchange) # store the breakdates bp_ts <- breakpoints(ts ~ 1) # this will give you the break dates and their confidence intervals summary(bp_ts) # store the confidence intervals ci_ts <- confint(bp_ts) ## to plot the breakpoints with confidence intervals plot(ts) lines(bp_ts) lines(ci_ts)

An illustration

I started with data on India’s rice crop productivity between 1950 (around Independence from British Colonial rule) and 2008. Here’s how it looks:

You can download the excel and CSV files here and here respectively.

Here’s the way to go using R:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 library(xlsx) library(forecast) library(tseries) library(strucchange) ## load the data from a CSV or Excel file. This example is done with an Excel sheet. prod_df <- read.xlsx(file = 'agricultural_productivity.xls', sheetIndex = 'Sheet1', rowIndex = 8:65, colIndex = 2, header = FALSE) colnames(prod_df) <- c('Rice') ## store rice data as time series objects rice <- ts(prod_df$Rice, start=c(1951, 1), end=c(2008, 1), frequency=1) # store the breakpoints bp.rice <- breakpoints(rice ~ 1) summary(bp.rice) ## the BIC chooses 5 breakpoints; plot the graph with breakdates and their confidence intervals plot(bp.rice) plot(rice) lines(bp.rice) ## confidence intervals ci.rice <- confint(bp.rice) ci.rice lines(ci.rice) Voila, this is what you get: The dotted vertical lines indicated the break dates; the horizontal red lines indicate their confidence intervals. This is a quick and dirty implementation. For a more detailed take, check out the documentation on the R package called strucchange. # MITx 15.071x (Analytics Edge) – 2016 I am auditing this course currently and just completed its 2nd assignment. It’s probably one of the best courses out there to learn R in a way that you go beyond the syntax with an objective in mind – to do analytics and run machine learning algorithms to derive insight from data. This course is different from machine learning courses by say, Andrew Ng in that this course won’t focus on coding the algorithm and rather would emphasize on diving right into the implementation of those algorithms using libraries that the R programming language already equips us with. Take a look at the course logistics. And hey, they’ve got a Kaggle competition! There’s still time to enroll and grab a certificate (or simply audit). The course is offered once a year. I met a bunch of people who did well at a data hackathon I had gone to recently, who had learned the ropes in data science thanks to Analytics Edge. # Detecting Structural Breaks in China’s FX Regime ##### Edit: This post is in its infancy. Work is still ongoing as far as deriving insight from the data is concerned. More content and economic insight is expected to be added to this post as and when progress is made in that direction. This is an attempt to detect structural breaks in China’s FX regime using Frenkel Wei regression methodology (this was later improved by Perron and Bai). I came up with the motivation to check for these structural breaks while attending a guest lecture on FX regimes by Dr. Ajay Shah delivered at IGIDR. This is work that I and two other classmates are working on as a term paper project under the supervision of Dr. Rajeswari Sengupta. The code below can be replicated and run as is, to get same results. This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters  ## if fxregime or strucchange package is absent from installed packages, download it and load it if(!require('fxregime')){ install.packages("fxregime") } if(!require('strucchange')){ install.packages("strucchange") } ## load packages library("fxregime") library('strucchange') # load the necessary data related to exchange rates - 'FXRatesCHF' # this dataset treats CHF as unit currency data("FXRatesCHF", package = "fxregime") ## compute returns for CNY (and explanatory currencies) ## since China abolished fixed USD regime cny <- fxreturns("CNY", frequency = "daily", start = as.Date("2005-07-25"), end = as.Date("2010-02-12"), other = c("USD", "JPY", "EUR", "GBP")) ## compute all segmented regression with minimal segment size of ## h = 100 and maximal number of breaks = 10 regx <- fxregimes(CNY ~ USD + JPY + EUR + GBP, data = cny, h = 100, breaks = 10, ic = "BIC") ## Print summary of regression results summary(regx) ## minimum BIC is attained for 2-segment (1-break) model plot(regx) round(coef(regx), digits = 3) sqrt(coef(regx)[, "(Variance)"]) ## inspect associated confidence intervals cit <- confint(regx, level = 0.9) cit breakdates(cit) ## plot LM statistics along with confidence interval flm <- fxlm(CNY ~ USD + JPY + EUR + GBP, data = cny) scus <- gefp(flm, fit = NULL) plot(scus, functional = supLM(0.1)) ## add lines related to breaks to your plot lines(cit) As can be seen in the figure below, the structural breaks correspond to the vertical bars. We are still working on understanding the motivations of China’s central bank in varying the degree of the managed float exchange rate. EDIT (May 16, 2016): The code above uses data provided by the package itself. If you wished to replicate this analysis on data after 2010, you will have to use your own data. We used Quandl, which lets you get 10 premium datasets for free. An API key (for only 10 calls on premium datasets) is provided if you register there. Foreign exchange rate data (2000 onward till date) apparently, is premium data. You can find these here. Here are the (partial) results and code to work the same methodology on the data from 2010 to 2016: This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters  ## if fxregime is absent from installed packages, download it and load it if(!require('fxregime')){ install.packages("fxregime") } ## load package library("fxregime") # load the necessary data related to exchange rates - 'FXRatesCHF' # this dataset treats CHF as unit currency # install / load Quandl if(!require('Quandl')){ install.packages("Quandl") } library(Quandl) # Extract and load currency data series with respect to CHF from Quandl # Extract data series from Quandl. Each Quandl user will have unique api_key # upon signing up. The freemium version allows access up to 10 fx rate data sets # USDCHF <- Quandl("CUR/CHF", api_key="p2GsFxccPGFSw7n1-NF9") # write.csv(USDCHF, file = "USDCHF.csv") # USDCNY <- Quandl("CUR/CNY", api_key="p2GsFxccPGFSw7n1-NF9") # write.csv(USDCNY, file = "USDCNY.csv") # USDJPY <- Quandl("CUR/JPY", api_key="p2GsFxccPGFSw7n1-NF9") # write.csv(USDJPY, file = "USDJPY.csv") # USDEUR <- Quandl("CUR/EUR", api_key="p2GsFxccPGFSw7n1-NF9") # write.csv(USDEUR, file = "USDEUR.csv") # USDGBP <- Quandl("CUR/GBP", api_key="p2GsFxccPGFSw7n1-NF9") # write.csv(USDGBP, file = "USDGBP.csv") # load the data sets into R USDCHF <- read.csv("G:/China's Economic Woes/USDCHF.csv") USDCHF <- USDCHF[,2:3] USDCNY <- read.csv("G:/China's Economic Woes/USDCNY.csv") USDCNY <- USDCNY[,2:3] USDEUR <- read.csv("G:/China's Economic Woes/USDEUR.csv") USDEUR <- USDEUR[,2:3] USDGBP <- read.csv("G:/China's Economic Woes/USDGBP.csv") USDGBP <- USDGBP[,2:3] USDJPY <- read.csv("G:/China's Economic Woes/USDJPY.csv") USDJPY <- USDJPY[,2:3] start = 1 # corresponds to 2016-05-12 end = 2272 # corresponds to 2010-02-12 dates <- as.Date(USDCHF[start:end,1]) USD <- 1/USDCHF[start:end,2] CNY <- USDCNY[start:end,2]/USD JPY <- USDJPY[start:end,2]/USD EUR <- USDEUR[start:end,2]/USD GBP <- USDGBP[start:end,2]/USD # reverse the order of the vectors to reflect dates from 2005 - 2010 instead of # the other way around USD <- USD[length(USD):1] CNY <- CNY[length(CNY):1] JPY <- JPY[length(JPY):1] EUR <- EUR[length(EUR):1] GBP <- GBP[length(GBP):1] dates <- dates[length(dates):1] df <- data.frame(CNY, USD, JPY, EUR, GBP) df$weekday <- weekdays(dates) row.names(df) <- dates df <- subset(df, weekday != 'Sunday') df <- subset(df, weekday != 'Saturday') df <- df[,1:5] zoo_df <- as.zoo(df) # Code to replicate analysis cny_rep <- fxreturns("CNY", data = zoo_df, frequency = "daily", other = c("USD", "JPY", "EUR", "GBP")) time(cny_rep) <- as.Date(row.names(df)[2:1627]) regx_rep <- fxregimes(CNY ~ USD + JPY + EUR + GBP, data = cny_rep, h = 100, breaks = 10, ic = "BIC") summary(regx_rep) ## minimum BIC is attained for 2-segment (5-break) model plot(regx_rep) round(coef(regx_rep), digits = 3) sqrt(coef(regx_rep)[, "(Variance)"]) ## inspect associated confidence intervals cit_rep <- confint(regx_rep, level = 0.9) breakdates(cit_rep) ## plot LM statistics along with confidence interval flm_rep <- fxlm(CNY ~ USD + JPY + EUR + GBP, data = cny_rep) scus_rep <- gefp(flm_rep, fit = NULL) plot(scus_rep, functional = supLM(0.1)) ## add lines related to breaks to your plot lines(cit_rep) apply(cny_rep,1,function(x) sum(is.na(x)))

We got breaks in 2010 and in 2015 (when China’s stock markets crashed). We would have hoped for more breaks (we can still get them), but that would depend on the parameters chosen for our regression.

# Data Manipulation in R with dplyr – Part 3

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # %>% OPERATOR ---------------------------------------------------------------------- # with %>% operator hflights %>% mutate(diff = TaxiOut - TaxiIn) %>% filter(!is.na(diff)) %>% summarise(avg = mean(diff)) # without %>% operator # arguments get further and further apart summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)), avg = mean(diff)) # with %>% operator d <- hflights %>% select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # without %>% operator d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime), RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) # Filter and summarise d d %>% filter(!is.na(mph), mph < 70) %>% summarise(n_less = n(), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Let's define preferable flights as flights that are 150% faster than driving, # i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or # diverted flights are less preferable than driving. # ADVANCED PIPING EXERCISES # Use one single piped call to print a summary with the following variables: # n_non - the number of non-preferable flights in hflights, # p_non - the percentage of non-preferable flights in hflights, # n_dest - the number of destinations that non-preferable flights traveled to, # min_dist - the minimum distance that non-preferable flights traveled, # max_dist - the maximum distance that non-preferable flights traveled hflights %>% mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>% filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>% summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest), min_dist = min(Distance), max_dist = max(Distance)) # Use summarise() to create a summary of hflights with a single variable, n, # that counts the number of overnight flights. These flights have an arrival # time that is earlier than their departure time. Only include flights that have # no NA values for both DepTime and ArrTime in your count. hflights %>% mutate(overnight = (ArrTime < DepTime)) %>% filter(overnight == TRUE) %>% summarise(n = n())

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # group_by() ------------------------------------------------------------------------- # Generate a per-carrier summary of hflights with the following variables: n_flights, # the number of flights flown by the carrier; n_canc, the number of cancelled flights; # p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of # flights whose delay does not equal NA. Next, order the carriers in the summary from # low to high by their average arrival delay. Use percentage of flights cancelled to # break any ties. Which airline scores best based on these statistics? hflights %>% group_by(UniqueCarrier) %>% summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights, avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay) # Generate a per-day-of-week summary of hflights with the variable avg_taxi, # the average total taxiing time. Pipe this summary into an arrange() call such # that the day with the highest avg_taxi comes first. hflights %>% group_by(DayOfWeek) %>% summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>% arrange(desc(avg_taxi))
view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

# My First Data Science Hackathon

So after 8 months of playing around with R and Python and blog post after blog post, I found myself finally hacking away at a problem set from the 17th storey of the Hindustan Times building at Connaught Place. I had entered my first ever data science hackathon conducted by Analytics Vidhya, a pioneer in analytics learning in India. Pizzas and Pepsi were on the house. Like any predictive analysis hackathon, this one accepted unlimited entries till submission time. It was from 2pm to 4:30pm today –  2.5 hours, of which I ended up wasting 1.5 hours trying to make my first submission which encountered submission error after submission error until the problem was fixed finally post lunch. I had 1 hour to try my best. It wasn’t the best performance, but I thought of blogging this experience anyway, as a reminder of the work that awaits me. I want to be the one winning prize money at the end of the day.

🙂

# Data Manipulation in R with dplyr – Part 2

Note that this post is in continuation with Part 1 of this series of posts on data manipulation with dplyr in R. The code in this post carries forward from the variables / objects defined in Part 1.

In the previous post, I talked about how dplyr provides a grammar of sorts to manipulate data, and consists of 5 verbs to do so:

The 5 verbs of dplyr
select – removes columns from a dataset
filter – removes rows from a dataset
arrange – reorders rows in a dataset
mutate – uses the data to build new columns and values
summarize – calculates summary statistics

I went on to discuss examples using select() and mutate(). Let’s now talk about filter(). R comes with a set of logical operators that you can use inside filter(). These operators are:
x < y, TRUE if x is less than y
x <= y, TRUE if x is less than or equal to y
x == y, TRUE if x equals y
x != y, TRUE if x does not equal y
x >= y, TRUE if x is greater than or equal to y
x > y, TRUE if x is greater than y
x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)

The following call, for example, filters df such that only the observations where the variable a is greater than the variable b:
filter(df, a > b)

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Print out all flights in hflights that traveled 3000 or more miles filter(hflights, Distance > 3000) # All flights flown by one of JetBlue, Southwest, or Delta filter(hflights, UniqueCarrier %in% c('JetBlue', 'Southwest', 'Delta')) # All flights where taxiing took longer than flying filter(hflights, TaxiIn + TaxiOut > AirTime)
view raw verbs05.r hosted with ❤ by GitHub

Combining tests using boolean operators
R also comes with a set of boolean operators that you can use to combine multiple logical tests into a single test. These include & (and), | (or), and ! (not). Instead of using the & operator, you can also pass several logical tests to filter(), separated by commas. The following calls equivalent:

filter(df, a > b & c > d)
filter(df, a > b, c > d)

The is.na() will also come in handy very often. This expression, for example, keeps the observations in df for which the variable x is not NA:

filter(df, !is.na(x))

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Combining tests using boolean operators # All flights that departed before 5am or arrived after 10pm filter(hflights, DepTime < 500 | ArrTime > 2200 ) # All flights that departed late but arrived ahead of schedule filter(hflights, DepDelay > 0 & ArrDelay < 0) # All cancelled weekend flights filter(hflights, DayOfWeek %in% c(6,7) & Cancelled == 1) # All flights that were cancelled after being delayed filter(hflights, Cancelled == 1, DepDelay > 0)
view raw verbs06.r hosted with ❤ by GitHub

A recap on select(), mutate() and filter():

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Summarizing Exercise # Select the flights that had JFK as their destination: c1 c1 <- filter(hflights, Dest == 'JFK') # Combine the Year, Month and DayofMonth variables to create a Date column: c2 c2 <- mutate(c1, Date = paste(Year, Month, DayofMonth, sep = "-")) # Print out a selection of columns of c2 select(c2, Date, DepTime, ArrTime, TailNum) # How many weekend flights flew a distance of more than 1000 miles # but had a total taxiing time below 15 minutes? nrow(filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15))
view raw verbs07.r hosted with ❤ by GitHub

Arranging Data
arrange() can be used to rearrange rows according to any type of data. If you pass arrange() a character variable, R will rearrange the rows in alphabetical order according to values of the variable. If you pass a factor variable, R will rearrange the rows according to the order of the levels in your factor (running levels() on the variable reveals this order).

By default, arrange() arranges the rows from smallest to largest. Rows with the smallest value of the variable will appear at the top of the data set. You can reverse this behaviour with the desc() function. arrange() will reorder the rows from largest to smallest values of a variable if you wrap the variable name in desc() before passing it to arrange()

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Definition of dtc dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) # Arrange dtc by departure delays arrange(dtc, DepDelay) # Arrange dtc so that cancellation reasons are grouped arrange(dtc, CancellationCode) # Arrange dtc according to carrier and departure delays arrange(dtc, UniqueCarrier, DepDelay) # Arrange according to carrier and decreasing departure delays arrange(hflights, UniqueCarrier, desc(DepDelay)) # Arrange flights by total delay (normal order). arrange(hflights, DepDelay + ArrDelay) # Keep flights leaving to DFW before 8am and arrange according to decreasing AirTime arrange(filter(hflights, Dest == 'DFW', DepTime < 800), desc(AirTime))
view raw verbs08.r hosted with ❤ by GitHub

Summarizing Data

summarise(), the last of the 5 verbs, follows the same syntax as mutate(), but the resulting dataset consists of a single row instead of an entire new column in the case of mutate().

In contrast to the four other data manipulation functions, summarise() does not return an altered copy of the dataset it is summarizing; instead, it builds a new dataset that contains only the summarizing statistics.

Note: summarise() and summarize() both work the same!

You can use any function you like in summarise(), so long as the function can take a vector of data and return a single number. R contains many aggregating functions. Here are some of the most useful:

min(x) – minimum value of vector x.
max(x) – maximum value of vector x.
mean(x) – mean value of vector x.
median(x) – median value of vector x.
quantile(x, p) – pth quantile of vector x.
sd(x) – standard deviation of vector x.
var(x) – variance of vector x.
IQR(x) – Inter Quartile Range (IQR) of vector x.
diff(range(x)) – total range of vector x.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Print out a summary with variables min_dist and max_dist summarize(hflights, min_dist = min(Distance), max_dist = max(Distance)) # Print out a summary with variable max_div summarize(filter(hflights, Diverted == 1), max_div = max(Distance)) # Remove rows that have NA ArrDelay: temp1 temp1 <- filter(hflights, !is.na(ArrDelay)) # Generate summary about ArrDelay column of temp1 summarise(temp1, earliest = min(ArrDelay), average = mean(ArrDelay), latest = max(ArrDelay), sd = sd(ArrDelay)) # Keep rows that have no NA TaxiIn and no NA TaxiOut: temp2 temp2 <- filter(hflights, !is.na(TaxiIn), !is.na(TaxiOut)) # Print the maximum taxiing difference of temp2 with summarise() summarise(temp2, max_taxi_diff = max(abs(TaxiIn - TaxiOut)))
view raw verbs09.r hosted with ❤ by GitHub

dplyr provides several helpful aggregate functions of its own, in addition to the ones that are already defined in R. These include:

first(x) – The first element of vector x.
last(x) – The last element of vector x.
nth(x, n) – The nth element of vector x.
n() – The number of rows in the data.frame or group of observations that summarise() describes.
n_distinct(x) – The number of unique values in vector x

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # Generate summarizing statistics for hflights summarise(hflights, n_obs = n(), n_carrier = n_distinct(UniqueCarrier), n_dest = n_distinct(Dest), dest100 = nth(Dest, 100)) # Filter hflights to keep all American Airline flights: aa aa <- filter(hflights, UniqueCarrier == "American") # Generate summarizing statistics for aa summarise(aa, n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*(n_canc/n_flights), avg_delay = mean(ArrDelay, na.rm = TRUE))
view raw verbs10.r hosted with ❤ by GitHub

This would be it for Part-2 of this series of posts on data manipulation with dplyr. Part 3 would focus on the pipe operator, Group_by and working with databases.