# Scraping the Daily India Covid-19 Tracker for CSV Data

This is a very short post that will be very useful to help you quickly set up your COVID-19 datasets. I’m sharing code at the end of this post that scrapes through all CSV datasets made available by COVID19-India API.

Copy paste this standalone script into your R environment and get going!

There are 15+ CSV files on the India COVID-19 API website. raw_data3 is actually a live dataset and more can be expected in the days to come, which is why a script that automates the data sourcing comes in handy.  Snapshot of the file names and the data dimensions as of today, 100 days since the first case was recorded in the state of Kerala —

My own analysis of the data and predictions are work-in-progress, going into a Github repo. Execute the code below and get started analyzing the data and fighting COVID-19!

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 rm(list = ls()) # Load relevant libraries ----------------------------------------------------- library(stringr) library(data.table) # ============================================================================= # COVID 19-India API: A volunteer-driven, crowdsourced database # for COVID-19 stats & patient tracing in India # ============================================================================= url <- # List out all CSV files to source -------------------------------------------- html <- paste(readLines(url), collapse="\n") pattern <- matched <- unlist(str_match_all(string = html, pattern = pattern)) # Downloading the Data -------------------------------------------------------- covid_datasets <- lapply(as.list(matched), fread) # Naming the data objects appropriately --------------------------------------- exclude_chars <- dataset_names <- substr(x = matched, start = 1 + nchar(exclude_chars), stop = nchar(matched)- nchar(".csv")) # assigning variable names for(i in seq_along(dataset_names)){ assign(dataset_names[i], covid_datasets[[i]]) }

# Linear Algebra behind the lm() function in R

This post comes out of the blue, nearly 2 years since my last one. I realize I’ve been lazy, so here’s hoping I move from an inertia of rest to that of motion, implying, regular and (hopefully) relevant posts. I also chanced upon some wisdom while scrolling through my Twitter feed:

This blog post in particular was meant to be a reminder to myself and other R users that the much used lm() function in R (for fitting linear models) can be replaced with some handy matrix operations to obtain regression coefficients, their standard errors and other goodness-of-fit stats printed out when summary() is called on an lm object.

Linear regression can be formulated mathematically as follows:
$\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon}$,
$\mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I})$

$\mathbf{y}$ is the $\mathbf{n}\times \mathbf{1}$ outcome variable and $\mathbf{X}$ is the $\mathbf{n}\times \mathbf{(\mathbf{k}+1)}$ data matrix of independent predictor variables (including a vector of ones corresponding to the intercept). The ordinary least squares (OLS) estimate for the vector of coefficients $\mathbf{\beta}$ is:

$\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}$

The covariance matrix can be obtained with some handy matrix operations:
$\textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}$
given that $\textrm{Var}(AX) = A \times \textrm{Var}X \times A^{\prime}; \textrm{Var}(\mathbf{y}) = \mathbf{\sigma^2}$

The standard errors of the coefficients are basically $\textrm{Diag}(\sqrt{\textrm{Var}(\hat{\mathbf{\beta}})}) = \textrm{Diag}(\sqrt{\sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}})$ and with these, one can compute the t-statistics and their corresponding p-values.

Lastly, the F-statistic and its corresponding p-value can be calculated after computing the two residual sum of squares (RSS) statistics:

• $\mathbf{RSS}$ – for the full model with all predictors
• $\mathbf{RSS_0}$ – for the partial model ($\mathbf{y} = \mathbf{\mu} + \mathbf{\nu}; \mathbf{\mu} = \mathop{\mathbb{E}}[\mathbf{y}]; \mathbf{\nu} \sim N(0, \sigma_0^2 \mathbf{I})$) with the outcome observed mean as estimated outcome

$\mathbf{F} = \frac{(\mathbf{RSS_0}-\mathbf{RSS})/\mathbf{k}}{\mathbf{RSS}/(\mathbf{n}-\mathbf{k}-1)}$

I wrote some R code to construct the output from summarizing lm objects, using all the math spewed thus far. The data used for this exercise is available in R, and comprises of standardized fertility measures and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from 1888. Try it out and see for yourself the linear algebra behind linear regression.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

# Quick Way of Installing all your old R libraries on a New Device

I recently bought a new laptop and began installing essential software all over again, including R of course! And I wanted all the libraries that I had installed in my previous laptop. Instead of installing libraries one by one all over again, I did the following:

Step 1: Save a list of packages installed in your old computing device (from your old device).

 installed <- as.data.frame(installed.packages()) write.csv(installed, 'installed_previously.csv') 

This saves information on installed packages in a csv file named installed_previously.csv. Now copy or e-mail this file to your new device and access it from your working directory in R.

Step 2: Create a list of libraries from your old list that were not already installed when you freshly download R (from your new device).

 installedPreviously <- read.csv('installed_previously.csv') baseR <- as.data.frame(installed.packages()) toInstall <- setdiff(installedPreviously, baseR) 

We now have a list of libraries that were installed in your previous computer in addition to the R packages already installed when you download R. So you now go ahead and install these libraries.

Step 3: Download this list of libraries.

 install.packages(toInstall) 

That’s it. Save yourself the trouble installing packages one-by-one all over again.

# Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

The most conventional approach to determine structural breaks in longitudinal data seems to be the Chow Test.

From Wikipedia,

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war). In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

As shown in the figure below, regressions on the 2 sub-intervals seem to have greater explanatory power than a single regression over the data.

For the data above, determining the sub-intervals is an easy task. However, things may not look that simple in reality. Conducting a Chow test for structural breaks leaves the data scientist at the mercy of his subjective gaze in choosing a null hypothesis for a break point in the data.

Instead of choosing the breakpoints in an exogenous manner, what if the data itself could learn where these breakpoints lie? Such an endogenous technique is what Bai and Perron came up with in a seminal paper published in 1998 that could detect multiple structural breaks in longitudinal data. A later paper in 2003 dealt with the testing for breaks empirically, using a dynamic programming algorithm based on the Bellman principle.

I will discuss a quick implementation of this technique in R.

Brief Outline:

Assuming you have a ts object (I don’t know whether this works with zoo, but it should) in R, called ts. Then implement the following:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 # assuming you have a 'ts' object in R # 1. install package 'strucchange' # 2. Then write down this code: library(strucchange) # store the breakdates bp_ts <- breakpoints(ts ~ 1) # this will give you the break dates and their confidence intervals summary(bp_ts) # store the confidence intervals ci_ts <- confint(bp_ts) ## to plot the breakpoints with confidence intervals plot(ts) lines(bp_ts) lines(ci_ts)

An illustration

I started with data on India’s rice crop productivity between 1950 (around Independence from British Colonial rule) and 2008. Here’s how it looks:

You can download the excel and CSV files here and here respectively.

Here’s the way to go using R:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 library(xlsx) library(forecast) library(tseries) library(strucchange) ## load the data from a CSV or Excel file. This example is done with an Excel sheet. prod_df <- read.xlsx(file = 'agricultural_productivity.xls', sheetIndex = 'Sheet1', rowIndex = 8:65, colIndex = 2, header = FALSE) colnames(prod_df) <- c('Rice') ## store rice data as time series objects rice <- ts(prod_df\$Rice, start=c(1951, 1), end=c(2008, 1), frequency=1) # store the breakpoints bp.rice <- breakpoints(rice ~ 1) summary(bp.rice) ## the BIC chooses 5 breakpoints; plot the graph with breakdates and their confidence intervals plot(bp.rice) plot(rice) lines(bp.rice) ## confidence intervals ci.rice <- confint(bp.rice) ci.rice lines(ci.rice)

Voila, this is what you get:

The dotted vertical lines indicated the break dates; the horizontal red lines indicate their confidence intervals.

This is a quick and dirty implementation. For a more detailed take, check out the documentation on the R package called strucchange.

# Abu Mostafa’s Machine Learning MOOC – Now on EdX

This was in the pipeline for quite some time now. I have been waiting for his lectures on a platform such as EdX or Coursera, and the day has arrived. You can enroll and start with week 1’s lectures as they’re live now.

This course is taught by none other than Dr. Yaser S. Abu – Mostafa, whose textbook on machine learning, Learning from Data is #1 bestseller textbook (Amazon) in all categories of Computer Science. His online course has been offered earlier over here.

Teaching

Dr. Abu-Mostafa received the Clauser Prize for the most original doctoral thesis at Caltech. He received the ASCIT Teaching Awards in 1986, 1989 and 1991, the GSC Teaching Awards in 1995 and 2002, and the Richard P. Feynman prize for excellence in teaching in 1996.

Live ‘One-take’ Recordings

The lectures have been recorded from a live broadcast (including Q&A, which will let you gauge the level of CalTech students taking this course). In fact, it almost seems as though Abu Mostafa takes a direct jab at Andrew Ng’s popular Coursera MOOC by stating the obvious on his course page.

A real Caltech course, not a watered-down version

Again, while enrolling note that this is what Abu Mostafa had to say about the online course:  “A Caltech course does not cater to short attention spans, and it may not provide instant gratification…[like] many MOOCs out there that are quite simple and have a ‘video game’ feel to them.” Unsurprisingly, many online students have dropped out in the past, but some of those students who “complained early on but decided to stick with the course had very flattering words to say at the end”.

Prerequisites

• Basic probability
• Basic matrices
• Basic calculus
• Some programming language/platform (I choose Python!)

If you’re looking for a challenging machine learning course, this is probably one you must take.

# Implementing Undirected Graphs in Python

There are 2 popular ways of representing an undirected graph.

Each list describes the set of neighbors of a vertex in the graph.

The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

Here’s an implementation of the above in Python:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

Output:

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 {} {} ["A:['B', 'C', 'E']", "C:['A', 'B', 'D', 'E']", "B:['A', 'C', 'D']", "E:['A', 'C']", "D:['B', 'C']"] [[ 0. 1. 1. 0. 1.] [ 1. 0. 1. 1. 0.] [ 1. 1. 0. 1. 1.] [ 0. 1. 1. 0. 0.] [ 1. 0. 1. 0. 0.]]

# Deterministic Selection Algorithm Python Code

Through this post, I’m sharing Python code implementing the median of medians algorithm, an algorithm that resembles quickselect, differing only in the way in which the pivot is chosen, i.e, deterministically, instead of at random.

Its best case complexity is O(n) and worst case complexity O(nlog2n)

I don’t have a formal education in CS, and came across this algorithm while going through Tim Roughgarden’s Coursera MOOC on the design and analysis of algorithms. Check out my implementation in Python.

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 def merge_tuple(a,b): """ Function to merge two arrays of tuples """ c = [] while len(a) != 0 and len(b) != 0: if a[0][0] < b[0][0]: c.append(a[0]) a.remove(a[0]) else: c.append(b[0]) b.remove(b[0]) if len(a) == 0: c += b else: c += a return c def mergesort_tuple(x): """ Function to sort an array using merge sort algorithm """ if len(x) == 0 or len(x) == 1: return x else: middle = len(x)/2 a = mergesort_tuple(x[:middle]) b = mergesort_tuple(x[middle:]) return merge_tuple(a,b) def lol(x,k): """ Function to divide a list into a list of lists of size k each. """ return [x[i:i+k] for i in range(0,len(x),k)] def preprocess(x): """ Function to assign an index to each element of a list of integers, outputting a list of tuples""" return zip(x,range(len(x))) def partition(x, pivot_index = 0): """ Function to partition an unsorted array around a pivot""" i = 0 if pivot_index !=0: x[0],x[pivot_index] = x[pivot_index],x[0] for j in range(len(x)-1): if x[j+1] < x[0]: x[j+1],x[i+1] = x[i+1],x[j+1] i += 1 x[0],x[i] = x[i],x[0] return x,i def ChoosePivot(x): """ Function to choose pivot element of an unsorted array using 'Median of Medians' method. """ if len(x) <= 5: return mergesort_tuple(x)[middle_index(x)] else: lst = lol(x,5) lst = [mergesort_tuple(el) for el in lst] C = [el[middle_index(el)] for el in lst] return ChoosePivot(C) def DSelect(x,k): """ Function to """ if len(x) == 1: return x[0] else: xpart = partition(x,ChoosePivot(preprocess(x))[1]) x = xpart[0] # partitioned array j = xpart[1] # pivot index if j == k: return x[j] elif j > k: return DSelect(x[:j],k) else: k = k - j - 1 return DSelect(x[(j+1):], k) arr = range(100,0,-1) print DSelect(arr,50) %timeit DSelect(arr,50)
view raw DSelect.py hosted with ❤ by GitHub

I get the following output:

51
100 loops, best of 3: 2.38 ms per loop

Note that on the same input, quickselect is faster, giving us:

1000 loops, best of 3: 254 µs per loop

# scikit-learn Linear Regression Example

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.