Linear Algebra behind the lm() function in R

This post comes out of the blue, nearly 2 years since my last one. I realize I’ve been lazy, so here’s hoping I move from an inertia of rest to that of motion, implying, regular and (hopefully) relevant posts. I also chanced upon some wisdom while scrolling through my Twitter feed:

This blog post in particular was meant to be a reminder to myself and other R users that the much used lm() function in R (for fitting linear models) can be replaced with some handy matrix operations to obtain regression coefficients, their standard errors and other goodness-of-fit stats printed out when summary() is called on an lm object.

Linear regression can be formulated mathematically as follows:
\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} ,
\mathbf{\epsilon} \sim N(0, \sigma^2 \mathbf{I})

\mathbf{y} is the \mathbf{n}\times \mathbf{1} outcome variable and \mathbf{X} is the \mathbf{n}\times \mathbf{(\mathbf{k}+1)} data matrix of independent predictor variables (including a vector of ones corresponding to the intercept). The ordinary least squares (OLS) estimate for the vector of coefficients \mathbf{\beta} is:

\hat{\mathbf{\beta}} = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \mathbf{y}

The covariance matrix can be obtained with some handy matrix operations:
\textrm{Var}(\hat{\mathbf{\beta}}) = (\mathbf{X}^{\prime} \mathbf{X})^{-1} \mathbf{X}^{\prime} \;\sigma^2 \mathbf{I} \; \mathbf{X} (\mathbf{X}^{\prime} \mathbf{X})^{-1} = \sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}
given that \textrm{Var}(AX) = A \times \textrm{Var}X \times A^{\prime}; \textrm{Var}(\mathbf{y}) = \mathbf{\sigma^2}

The standard errors of the coefficients are basically \textrm{Diag}(\sqrt{\textrm{Var}(\hat{\mathbf{\beta}})}) = \textrm{Diag}(\sqrt{\sigma^2 (\mathbf{X}^{\prime} \mathbf{X})^{-1}}) and with these, one can compute the t-statistics and their corresponding p-values.

Lastly, the F-statistic and its corresponding p-value can be calculated after computing the two residual sum of squares (RSS) statistics:

  • \mathbf{RSS} – for the full model with all predictors
  • \mathbf{RSS_0} – for the partial model (\mathbf{y} = \mathbf{\mu} + \mathbf{\nu}; \mathbf{\mu} = \mathop{\mathbb{E}}[\mathbf{y}]; \mathbf{\nu} \sim N(0, \sigma_0^2 \mathbf{I}) ) with the outcome observed mean as estimated outcome

\mathbf{F} = \frac{(\mathbf{RSS_0}-\mathbf{RSS})/\mathbf{k}}{\mathbf{RSS}/(\mathbf{n}-\mathbf{k}-1)}

I wrote some R code to construct the output from summarizing lm objects, using all the math spewed thus far. The data used for this exercise is available in R, and comprises of standardized fertility measures and socio-economic indicators for each of 47 French-speaking provinces of Switzerland from 1888. Try it out and see for yourself the linear algebra behind linear regression.

### Linear Regression Using lm() ----------------------------------------
data("swiss")
dat <- swiss
linear_model <- lm(Fertility ~ ., data = dat)
summary(linear_model)
# Call:
# lm(formula = Fertility ~ ., data = dat)
#
# Residuals:
# Min 1Q Median 3Q Max
# -15.2743 -5.2617 0.5032 4.1198 15.3213
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
# Agriculture -0.17211 0.07030 -2.448 0.01873 *
# Examination -0.25801 0.25388 -1.016 0.31546
# Education -0.87094 0.18303 -4.758 2.43e-05 ***
# Catholic 0.10412 0.03526 2.953 0.00519 **
# Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 7.165 on 41 degrees of freedom
# Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
# F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
### Using Linear Algebra ------------------------------------------------
y <- matrix(dat$Fertility, nrow = nrow(dat))
X <- cbind(1, as.matrix(x = dat[,-1]))
colnames(X)[1] <- "(Intercept)"
# N x k matrix
N <- nrow(X)
k <- ncol(X) - 1 # number of predictor variables (ergo, excluding Intercept column)
# Estimated Regression Coefficients
beta_hat <- solve(t(X)%*%X)%*%(t(X)%*%y)
# Variance of outcome variable = Variance of residuals
sigma_sq <- residual_variance <- (N-k-1)^-1 * sum((y - X %*% beta_hat)^2)
residual_std_error <- sqrt(residual_variance)
# Variance and Std. Error of estimated coefficients of the linear model
var_betaHat <- sigma_sq * solve(t(X) %*% X)
coeff_std_errors <- sqrt(diag(var_betaHat))
# t values of estimates are ratio of estimated coefficients to std. errors
t_values <- beta_hat / coeff_std_errors
# p-values of t-statistics of estimated coefficeints
p_values_tstat <- 2 * pt(abs(t_values), N-k, lower.tail = FALSE)
# assigning R's significance codes to obtained p-values
signif_codes_match <- function(x){
ifelse(x <= 0.001,"***",
ifelse(x <= 0.01,"**",
ifelse(x < 0.05,"*",
ifelse(x < 0.1,"."," "))))
}
signif_codes <- sapply(p_values_tstat, signif_codes_match)
# R-squared and Adjusted R-squared (refer any econometrics / statistics textbook)
R_sq <- 1 - (N-k-1)*residual_variance / (N*mean((y - mean(y))^2))
R_sq_adj <- 1 - residual_variance / ((N/(N-1))*mean((y - mean(y))^2))
# Residual sum of squares (RSS) for the full model
RSS <- (N-k-1)*residual_variance
# RSS for the partial model with only intercept (equal to mean), ergo, TSS
RSS0 <- TSS <- sum((y - mean(y))^2)
# F statistic based on RSS for full and partial models
# k = degress of freedom of partial model
# N - k - 1 = degress of freedom of full model
F_stat <- ((RSS0 - RSS)/k) / (RSS/(N-k-1))
# p-values of the F statistic
p_value_F_stat <- pf(F_stat, df1 = k, df2 = N-k-1, lower.tail = FALSE)
# stitch the main results toghether
lm_results <- as.data.frame(cbind(beta_hat, coeff_std_errors,
t_values, p_values_tstat, signif_codes))
colnames(lm_results) <- c("Estimate","Std. Error","t value","Pr(>|t|)","")
### Print out results of all relevant calcualtions -----------------------
print(lm_results)
cat("Residual standard error: ",
round(residual_std_error, digits = 3),
" on ",N-k-1," degrees of freedom",
"\nMultiple R-squared: ",R_sq," Adjusted R-squared: ",R_sq_adj,
"\nF-statistic: ",F_stat, " on ",k-1," and ",N-k-1,
" DF, p-value: ", p_value_F_stat,"\n")
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 66.9151816789654 10.7060375853301 6.25022854119771 1.73336561301153e-07 ***
# Agriculture -0.172113970941457 0.0703039231786469 -2.44814177018405 0.0186186100433133 *
# Examination -0.258008239834722 0.253878200892098 -1.01626779663678 0.315320687313066
# Education -0.870940062939429 0.183028601571259 -4.75849159892283 2.3228265226988e-05 ***
# Catholic 0.104115330743766 0.035257852536169 2.95296858017545 0.00513556154915653 **
# Infant.Mortality 1.07704814069103 0.381719650858061 2.82156849475775 0.00726899472564356 **
# Residual standard error: 7.165 on 41 degrees of freedom
# Multiple R-squared: 0.706735 Adjusted R-squared: 0.670971
# F-statistic: 19.76106 on 4 and 41 DF, p-value: 5.593799e-10
view raw lm_linear_algebra.R hosted with ❤ by GitHub

Hope this was useful and worth your time!

Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data

Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.

This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.

Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460

If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.

The workaround to this problem is in the form of a function, remove_missing_levels  that I found here written by pat-s. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.

remove_missing_levels <- function(fit, test_data) {
library(magrittr)
# https://stackoverflow.com/a/39495480/4185785
# drop empty factor levels in test data
test_data %>%
droplevels() %>%
as.data.frame() -> test_data
# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
# account for it
if (any(class(fit) == "glmmPQL")) {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$contrasts))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
map(fit$contrasts, function(x) names(unmatrix(x))) %>%
unlist() -> factor_levels
factor_levels %>% str_split(":", simplify = TRUE) %>%
extract(, 1) -> factor_levels
model_factors <- as.data.frame(cbind(factors, factor_levels))
} else {
# Obtain factor predictors in the model and their levels
factors <- (gsub("[-^0-9]|as.factor|\\(|\\)", "",
names(unlist(fit$xlevels))))
# do nothing if no factors are present
if (length(factors) == 0) {
return(test_data)
}
factor_levels <- unname(unlist(fit$xlevels))
model_factors <- as.data.frame(cbind(factors, factor_levels))
}
# Select column names in test data that are factor predictors in
# trained model
predictors <- names(test_data[names(test_data) %in% factors])
# For each factor predictor in your data, if the level is not in the model,
# set the value to NA
for (i in 1:length(predictors)) {
found <- test_data[, predictors[i]] %in% model_factors[
model_factors$factors == predictors[i], ]$factor_levels
if (any(!found)) {
# track which variable
var <- predictors[i]
# set to NA
test_data[!found, predictors[i]] <- NA
# drop empty factor levels in test data
test_data %>%
droplevels() -> test_data
# issue warning to console
message(sprintf(paste0("Setting missing levels in '%s', only present",
" in test data but missing in train data,",
" to 'NA'."),
var))
}
}
return(test_data)
}

Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.

library(data.table)
train <- fread('train.csv'); test <- fread('test.csv')
# consolidate the 2 data sets after creating a variable indicating train / test
train$flag <- 0; test$flag <- 1
dat <- rbind(train,test)
# change outcome, var_b and var_e into factor var
dat$outcome <- factor(dat$outcome)
dat$var_b <- factor(dat$var_b)
dat$var_e <- factor(dat$var_e)
# check the levels of var_b and var_e in this consolidated, train and test data sets
length(levels(dat$var_b)); length(unique(train$var_b)); length(unique(test$var_b))
# get back the train and test data
train <- subset(dat, flag == 0); test <- subset(dat, flag == 1)
train$flag <- NULL; test$flag <- NULL
# Build Logit Model using train data and make predictions
logitModel <- glm(outcome ~ ., data = train, family = 'binomial')
preds_train <- predict(logitModel, type = 'response')
# Model Predictions on test data
preds_test <- predict(logitModel, newdata = test, type = 'response')
# running the above code gives us the following error:
# factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
# 25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460
# Workaround:
source('remove_missing_levels.R')
preds_test <- predict(logitModel,
newdata = remove_missing_levels(fit = logitModel, test_data = test),
type = 'response')
view raw factor_new_levels.R hosted with ❤ by GitHub

scikit-learn Linear Regression Example

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.