Linear Algebra behind the lm() function in R

This post comes out of the blue, nearly 2 years since my last one. I realize I’ve been lazy, so here’s hoping I move from an inertia of rest to that of motion, implying, regular and (hopefully) relevant posts. I also chanced upon some wisdom while scrolling through my Twitter feed:

This blog post in particular was meant to be a reminder to myself and other R users that the much used lm() function in R (for fitting linear models) can be replaced with some handy matrix operations to obtain regression coefficients, their standard errors and other goodness-of-fit stats printed out when summary() is called on an lm object.

Linear regression can be formulated mathematically as follows:
\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} ,
\mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I})

\mathbf{y} is the \mathbf{n}\times \mathbf{1} outcome variable and \mathbf{X} is the \mathbf{n}\times \mathbf{(\mathbf{k}+1)} data matrix of independent predictor variables (including a vector of ones corresponding to the intercept). The ordinary least squares (OLS) estimate for the vector of coefficients \mathbf{\beta} is:

\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}

The covariance matrix can be obtained with some handy matrix operations:
\textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}
given that \textrm{Var}(AX) = A \times \textrm{Var}X \times A^{\prime}; \textrm{Var}(\mathbf{y}) = \mathbf{\sigma^2}

The standard errors of the coefficients are basically \textrm{Diag}(\sqrt{\textrm{Var}(\hat{\mathbf{\beta}})}) = \textrm{Diag}(\sqrt{\sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}}) and with these, one can compute the t-statistics and their corresponding p-values.

Lastly, the F-statistic and its corresponding p-value can be calculated after computing the two residual sum of squares (RSS) statistics:

  • \mathbf{RSS} – for the full model with all predictors
  • \mathbf{RSS_0} – for the partial model (\mathbf{y} = \mathbf{\mu} + \mathbf{\nu}; \mathbf{\mu} = \mathop{\mathbb{E}}[\mathbf{y}]; \mathbf{\nu} \sim N(0, \sigma_0^2 \mathbf{I}) ) with the outcome observed mean as estimated outcome

\mathbf{F} = \frac{(\mathbf{RSS_0}-\mathbf{RSS})/\mathbf{k}}{\mathbf{RSS}/(\mathbf{n}-\mathbf{k}-1)}

I wrote some R code to construct the output from summarizing lm objects, using all the math spewed thus far. The data used for this exercise is available in R, and comprises of standardized fertility measures and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from 1888. Try it out and see for yourself the linear algebra behind linear regression.

Hope this was useful and worth your time!

Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

The most conventional approach to determine structural breaks in longitudinal data seems to be the Chow Test.

From Wikipedia,

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war). In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

As shown in the figure below, regressions on the 2 sub-intervals seem to have greater explanatory power than a single regression over the data.

440px-chow_test_structural_break

For the data above, determining the sub-intervals is an easy task. However, things may not look that simple in reality. Conducting a Chow test for structural breaks leaves the data scientist at the mercy of his subjective gaze in choosing a null hypothesis for a break point in the data.

Instead of choosing the breakpoints in an exogenous manner, what if the data itself could learn where these breakpoints lie? Such an endogenous technique is what Bai and Perron came up with in a seminal paper published in 1998 that could detect multiple structural breaks in longitudinal data. A later paper in 2003 dealt with the testing for breaks empirically, using a dynamic programming algorithm based on the Bellman principle.

I will discuss a quick implementation of this technique in R.

Brief Outline:

Assuming you have a ts object (I don’t know whether this works with zoo, but it should) in R, called ts. Then implement the following:

An illustration 

I started with data on India’s rice crop productivity between 1950 (around Independence from British Colonial rule) and 2008. Here’s how it looks:

rice_productivity

You can download the excel and CSV files here and here respectively.

Here’s the way to go using R:

Voila, this is what you get:

02_rice_multiplebreaks

The dotted vertical lines indicated the break dates; the horizontal red lines indicate their confidence intervals.

This is a quick and dirty implementation. For a more detailed take, check out the documentation on the R package called strucchange.

Detecting Structural Breaks in China’s FX Regime

Edit: This post is in its infancy. Work is still ongoing as far as deriving insight from the data is concerned. More content and economic insight is expected to be added to this post as and when progress is made in that direction.

This is an attempt to detect structural breaks in China’s FX regime using Frenkel Wei regression methodology (this was later improved by Perron and Bai). I came up with the motivation to check for these structural breaks while attending a guest lecture on FX regimes by Dr. Ajay Shah delivered at IGIDR. This is work that I and two other classmates are working on as a term paper project under the supervision of Dr. Rajeswari Sengupta.

The code below can be replicated and run as is, to get same results.

As can be seen in the figure below, the structural breaks correspond to the vertical bars. We are still working on understanding the motivations of China’s central bank in varying the degree of the managed float exchange rate.

strucchange_china_2006_2010

EDIT (May 16, 2016):

The code above uses data provided by the package itself. If you wished to replicate this analysis on data after 2010, you will have to use your own data. We used Quandl, which lets you get 10 premium datasets for free. An API key (for only 10 calls on premium datasets) is provided if you register there. Foreign exchange rate data (2000 onward till date) apparently, is premium data. You can find these here.

Here are the (partial) results and code to work the same methodology on the data from 2010 to 2016:

20102016

We got breaks in 2010 and in 2015 (when China’s stock markets crashed). We would have hoped for more breaks (we can still get them), but that would depend on the parameters chosen for our regression.

 

Python to the Rescue

Another journal-like entry

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

Advice from an Old Programmer

I was reading a paper today, written by MIT’s Esther Duflo, part of a homework assignment on a MOOC on development policy (Foundations of Development Policy: Advanced Development Economics) offered by Duflo and Abhijit Banerjee. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary schools throughout the county

I got:

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv wuxfwhg ryhu 94/333 sulpdu| vfkrrov wkurxjkrxw wkh frxqwu|

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that ‘B’ became ‘E’, ‘e’ became ‘h’, ‘t’ became ‘w’ and so on. So I copied the entire content of the PDF to a text file and named the encrypted file estherDuflo.txt. I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file estherDufloDecrypted.txt.

Sample from the Encrypted File


My Code

Sample from the Decrypted File

Getting Started with R on MIT’s 14.74x (Foundations of Development Policy)

I noticed that a major grievance of many students enrolled in MIT‘s latest edX course on development policy (Foundations of Development Policy: Advanced Development Economics) was that there wasn’t enough done to get them going with the R assignments. I have posted the R code for the homework (past the deadline, of course) of the first 2 weeks, so that others get a hang of the level of R that might be needed to solve these assignments in the following weeks. I’m willing to help out those needing help getting up to speed with R required for this course. For specific queries, leave your message in the comments section.

A great place to get spend time learning R before taking Foundations of Development Policy (14.74x) would be another edX course that’s been getting great reviews recently: Introduction to R Programming

R Code for Home Work (Week 1)

R Code for Home Work (Week 2)

I hope this helps!

Object Oriented Programing with Python – Particle Diffusion Simulation

I’m a newbie to the programming world. I first started programming in Python in May this year, a month after I started this blog, so I still haven’t learnt enough to contribute to economics as is the stated goal of this blog. But I know I’ll get there in a year or less.

This blog was also meant to document my learning. In May, I would have called myself Newb v0.0. Today, 3 months later, I’d like to call myself Newb v0.3 and the goal is to be at least Expert v1.0 by January 2016.

With the help of Rice University’s awesome classes on Python programming I created a cool simulation of particles diffusing into space, using the concept of Classes, which I learnt just yesterday!

Click to check out the code !

Screenshot from 2015-07-23 11:49:00

Screenshot from 2015-07-23 11:49:10

Screenshot from 2015-07-23 11:49:39

Statistics: The Sexiest Job of the Decade

Anyone who’s got a formal education in economics knows who Hal Varian is. He’s most popularly known for his book Intermediate Economics. He’s also the Chief Economist at Google. He is known to have famously stated more or less, that statisticians and data analysts would be the sexiest jobs of the next decade.

That has come true, to a great extent, and we’ll be seeing more.

Great places to learn more about data science and statistical learning:
1] Statistical Learning (Stanford)
2] The Analytics Edge (MIT)

In a paper called ‘Big Data: New Tricks for Econometrics‘, Varian goes on to say that:

In fact, my standard advice to graduate students these days is “go to the computer science department and take a class in machine learning.” There have been very fruitful collaborations between computer scientists and statisticians in the last decade or so, and I expect collaborations between computer scientists and econometricians will also be productive in the future.

See Also: Slides on Machine Learning and Econometrics