Installing Tensorflow on Windows is Easy!

December 30, 2017December 30, 2017 Anirudh Technical how-to, Python, TensorFlow, Windows

I recently got myself to start using Python on Windows, whereas till very recently I had been working on Python only from Ubuntu.

I am sure I am late in realizing this, but installing Tensorflow was just so easy!

If you’ve tried installing Tensorflow for Windows when it was first introduced, and gave up back then – try again. The method I’d recommend would be using Anaconda Navigator from where you first open a terminal (figure below). You may notice that I already have a tensorflow environment set up, since I am writing this post after installation.

Once you have terminal open, create a conda environment named tensorflow by invoking the following command, with your python version:

C:> conda create -n tensorflow python=3.6

That’s all! You should now have tensorflow ready to use.

For more details, you could always go here. Otherwise, the screenshot below gives a sense of what it takes.

Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data

October 8, 2017 Anirudh Technical Code Snippets, GitHub, Linear Regression, Logistic Regression, Machine Learning, R

Let’s say you have data containing a categorical variable with 50 levels. When you divide the data into train and test sets, chances are you don’t have all 50 levels featuring in your training set.

This often happens when you divide the data set into train and test sets according to the distribution of the outcome variable. In doing so, chances are that our explanatory categorical variable might not be distributed exactly the same way in train and test sets – so much so that certain levels of this categorical variable are missing from the training set. The more levels there are to a categorical variable, it gets difficult for that variable to be similarly represented upon splitting the data.

Take for instance this example data set (train.csv + test.csv) which contains a categorical variable var_b that takes 349 unique levels. Our train data has 334 of these levels – on which the model is built – and hence 15 levels are excluded from our trained model. If you try making predictions on the test set with this model in R, it throws an error:
factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820, 25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460
If you’ve used R to model generalized linear class of models such as linear, logit or probit models, then chances are you’ve come across this problem – especially when you’re validating your trained model on test data.

The workaround to this problem is in the form of a function, remove_missing_levels that I found here written by pat-s. You need magrittr library installed and it can only work on lm, glm and glmmPQL objects.

	remove_missing_levels <- function(fit, test_data) {
	library(magrittr)

	# https://stackoverflow.com/a/39495480/4185785

	# drop empty factor levels in test data
	test_data %>%
	droplevels() %>%
	as.data.frame() -> test_data

	# 'fit' object structure of 'lm' and 'glmmPQL' is different so we need to
	# account for it
	if (any(class(fit) == "glmmPQL")) {
	# Obtain factor predictors in the model and their levels
	factors <- (gsub("[-^0-9]\|as.factor\|\$\|\$", "",
	names(unlist(fit$contrasts))))
	# do nothing if no factors are present
	if (length(factors) == 0) {
	return(test_data)
	}

	map(fit$contrasts, function(x) names(unmatrix(x))) %>%
	unlist() -> factor_levels
	factor_levels %>% str_split(":", simplify = TRUE) %>%
	extract(, 1) -> factor_levels

	model_factors <- as.data.frame(cbind(factors, factor_levels))
	} else {
	# Obtain factor predictors in the model and their levels
	factors <- (gsub("[-^0-9]\|as.factor\|\$\|\$", "",
	names(unlist(fit$xlevels))))
	# do nothing if no factors are present
	if (length(factors) == 0) {
	return(test_data)
	}

	factor_levels <- unname(unlist(fit$xlevels))
	model_factors <- as.data.frame(cbind(factors, factor_levels))
	}

	# Select column names in test data that are factor predictors in
	# trained model

	predictors <- names(test_data[names(test_data) %in% factors])

	# For each factor predictor in your data, if the level is not in the model,
	# set the value to NA

	for (i in 1:length(predictors)) {
	found <- test_data[, predictors[i]] %in% model_factors[
	model_factors$factors == predictors[i], ]$factor_levels
	if (any(!found)) {
	# track which variable
	var <- predictors[i]
	# set to NA
	test_data[!found, predictors[i]] <- NA
	# drop empty factor levels in test data
	test_data %>%
	droplevels() -> test_data
	# issue warning to console
	message(sprintf(paste0("Setting missing levels in '%s', only present",
	" in test data but missing in train data,",
	" to 'NA'."),
	var))
	}
	}
	return(test_data)
	}

view raw remove_missing_levels.R hosted with ❤ by GitHub

Once you’ve sourced the above function in R, you can seamlessly proceed with using your trained model to make predictions on the test set. The code below demonstrates this for the data set shared above. You can find these codes in one of my github repos and try it out yourself.

	library(data.table)

	train <- fread('train.csv'); test <- fread('test.csv')

	# consolidate the 2 data sets after creating a variable indicating train / test
	train$flag <- 0; test$flag <- 1
	dat <- rbind(train,test)

	# change outcome, var_b and var_e into factor var
	dat$outcome <- factor(dat$outcome)
	dat$var_b <- factor(dat$var_b)
	dat$var_e <- factor(dat$var_e)

	# check the levels of var_b and var_e in this consolidated, train and test data sets
	length(levels(dat$var_b)); length(unique(train$var_b)); length(unique(test$var_b))

	# get back the train and test data
	train <- subset(dat, flag == 0); test <- subset(dat, flag == 1)
	train$flag <- NULL; test$flag <- NULL

	# Build Logit Model using train data and make predictions
	logitModel <- glm(outcome ~ ., data = train, family = 'binomial')
	preds_train <- predict(logitModel, type = 'response')

	# Model Predictions on test data
	preds_test <- predict(logitModel, newdata = test, type = 'response')
	# running the above code gives us the following error:
	# factor var_b has new levels 16060, 17300, 17980, 19060, 21420, 21820,
	# 25220, 29340, 30300, 33260, 34100, 38340, 39660, 44300, 45460

	# Workaround:
	source('remove_missing_levels.R')
	preds_test <- predict(logitModel,
	newdata = remove_missing_levels(fit = logitModel, test_data = test),
	type = 'response')

view raw factor_new_levels.R hosted with ❤ by GitHub

Quick Way of Installing all your old R libraries on a New Device

July 27, 2017July 27, 2017 Anirudh Technical how-to, R

I recently bought a new laptop and began installing essential software all over again, including R of course! And I wanted all the libraries that I had installed in my previous laptop. Instead of installing libraries one by one all over again, I did the following:

Step 1: Save a list of packages installed in your old computing device (from your old device).

installed <- as.data.frame(installed.packages()) write.csv(installed, 'installed_previously.csv')

This saves information on installed packages in a csv file named installed_previously.csv. Now copy or e-mail this file to your new device and access it from your working directory in R.

Step 2: Create a list of libraries from your old list that were not already installed when you freshly download R (from your new device).

installedPreviously <- read.csv('installed_previously.csv') baseR <- as.data.frame(installed.packages()) toInstall <- setdiff(installedPreviously, baseR)

We now have a list of libraries that were installed in your previous computer in addition to the R packages already installed when you download R. So you now go ahead and install these libraries.

Step 3: Download this list of libraries.

install.packages(toInstall)

That’s it. Save yourself the trouble installing packages one-by-one all over again.

Key Insights on Sberbank Home Price Predicting Kaggle Competition Coming Soon…

June 29, 2017June 30, 2017 Anirudh Hackathons Data Science, Kaggle, Machine Learning, Python

This post is more about data science and Kaggle than about R or Python. I am currently taking part in my 2nd Kaggle competition, Sberbank Russian Housing Market — Can you predict realty price fluctuations in Russia’s volatile economy?

I’ve been stuck for about a week at the 52nd percentile among 3400+ Kagglers taking part in the competition. I’ve been told that Kaggle Kernels and discussion boards are helpful when you’re stuck or if you need to learn some practical data science that can’t be gleaned from books or tutorials.

One such discussion thread looks like this:

This person going by the pseudonym Schoolpal is currently killing it on the leaderboard and I’m eagerly looking forward to this person’s code once the competition ends in less than 24 hours. If you’re interested too, follow this discussion here.

Cheers!

Update:

This Schoolpal, as mentioned earlier, finally came in second and shared their approach here.

Discovering Python & R

— my journey as a worker bee in quant finance

Year: 2017

Installing Tensorflow on Windows is Easy!

Linear / Logistic Regression in R: Dealing With Unknown Factor Levels in Test Data

Quick Way of Installing all your old R libraries on a New Device

Key Insights on Sberbank Home Price Predicting Kaggle Competition Coming Soon…

Share this:

Share this:

Share this:

Share this: