Upgrading R / Installing R-3.2.0 on Ubuntu

Till recently, I was using R-3.1.1 on Windows OS. Then on April 16, 2015 (10 days ago), they released R-3.2.0. Upgrading it on Windows was easy peasy, not like the headache Ubuntu gave me.

I recently got a Dell Vostro 14 3000 series laptop with Ubuntu 12.04 installed. I haven’t yet upgraded to Ubuntu 14.04 because the graphics drivers for this computer aren’t available for that version. Besides, I’m not much of a gamer. If I were, I wouldn’t care for Ubuntu!

Anyway, I tried installing by typing the following on Terminal:

 sudo apt-get update  
 sudo apt-get install r-base r-base-dev  

R did get installed, but not the latest version. A much older version R-2.14.1. I later found out after quite a lot of time spent on StackExchange, that I had to choose a CRAN mirror that was geographically close to my computer, which would then act as a “software source” for the latest version of R. Now that explained why the above sudo commands weren’t getting me the desired version of software. It was because the the Ubuntu / Canonical software repositories only had an older R version. Also, the distribution line had to match the codename of my Ubuntu version (12.04 LTS).

 codename=$(lsb_release -c -s)  
 echo "deb http://ftp.iitm.ac.in/cran/bin/linux/ubuntu $codename/" | sudo tee -a /etc/apt/sources.list > /dev/null  

Note that instead of http://ftp.iitm.ac.in/cran one must replace it with the geographically closest CRAN mirror. Also, the Ubuntu archives on CRAN are signed with the key of Michael Rutter <marutter@gmail> with key ID E084DAB9. So we type in the following:

 sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9  
 sudo add-apt-repository ppa:marutter/rdev

Followed by what we would normally have done:

 sudo apt-get update  
 sudo apt-get upgrade  
 sudo apt-get install r-base r-base-dev  

This did the job for me, and I had R-3.2.0 installed successfully on my Ubuntu system. Compare this to Windows, where all you have to do is type in 3 lines (in R, and not Shell):


And to think I left Windows for Linux! I am a Linux newb, and God only knows why I wanted to try out Linux, but on giving it some thought, I think I know why

source: https://xkcd.com/456/ and http://xkcd.com/149/


Visualizing Macroeconomic Data using Choropleths in R

Choropleths are thematic maps shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map, such as population density or per-capita-income.

example choropleth

This post is about creating quick choropleth maps in R, with macroeconomic data across geographies.

As a sample exercise, I decided to get data on what percentage of their aggregate disbursements, do states in India spend on development expenditure. I got the data from the Reserve Bank of India’s website. I had to clean the data a little for easy handling in R. Here’s the cleaned data.

I used the choroplethr package designed by Ari Lamstein and Brian P Johnson to animate the data on the map of India. Here’s my code followed by output maps.

## load the requisite libraries into R
indianregions <- get_admin1_regions("india")
## gets dataframe of 2 columns with name of country ("india") throughout column 1
## and name of regions in 2nd column
## counts the number of regions under country "india"
setwd("C:/Anirudh/Coding/R/Practice/Practice Iteration 2")
df_dev_indicators <- read.xlsx("statewise_development_indicators.xls", sheetIndex = 1, colIndex = 2:5, rowIndex = 2:31, header = FALSE)
## reads excel data into an R dataframe
df_dev_indicators_2012 <- df_dev_indicators[c(1,2)]
df_dev_indicators_2013 <- df_dev_indicators[c(1,3)]
df_dev_indicators_2014 <- df_dev_indicators[c(1,4)]
## create 3 separate dataframes from the parent dataframe so as to have 2 columns,
## column 1 for region and column 2 for column 2 for value metric
names(df_dev_indicators_2012) <- c("region","value")
names(df_dev_indicators_2013) <- c("region","value")
names(df_dev_indicators_2014) <- c("region","value")
## assigning column names [required as per choroplethr function]
admin1_choropleth("india", df_dev_indicators_2012, title = "% Expenditure on Development in 2012", legend = "", buckets = 9, zoom = NULL)
## prints the choropleth map for 2012 indicators
southern_states <- c("state of karnataka","state of andhra pradesh", "state of kerala", "state of tamil nadu", "state of goa")
## stores regions that are to be printed as a bucket map
admin1_choropleth("india", df_dev_indicators_2012, title = "% Expenditure on Development in Southern States in 2012", legend = "", buckets = 9, zoom = southern_states)
## zooms into the buckets specified earlier
admin1_choropleth("india", df_dev_indicators_2012, title = "% Expenditure on Development in 2012", legend = "", buckets = 1, zoom = NULL)
admin1_choropleth("india", df_dev_indicators_2013, title = "% Expenditure on Development in 2013", legend = "", buckets = 1, zoom = NULL)
admin1_choropleth("india", df_dev_indicators_2014, title = "% Expenditure on Development in 2014", legend = "", buckets = 1, zoom = NULL)
view raw choroplethr.R hosted with ❤ by GitHub

…and as expected, the lines of code above print out the desired map

Expenditure on Development in Southern States (2012)

In the examples above I set the buckets attribute equal to 9. That set the data in discrete scales. Had I set buckets = 1 instead, we would have got a continuous scale of data.

Expenditure on Development (2012)_continuous

The same for the last 2 fiscal years:

Development Expenditures in the Last 2 Years

For the US, there are amazing packages for county level and ZIP code level detail of data visualization.

Here’s more on the choroplethr package for R and creating your own maps.

Getting Started

I have been searching for good MOOCs to get me started with R and Python programming languages. I’ve already begun the Johns Hopkins University Data Science Specialization on Coursera. It consists of 9 courses (including Data Scientist’s Toolbox, R programming, Getting and Cleaning Data, Exploratory Data Analysis, Reproducible Research, Statistical Inference, Regression Models, Practical Machine Learning and Developing Data Products), ending with a 7-week Capstone Project that I’m MOST excited about. I want to get there fast.

The Capstone would consist of :
  • Building a predictive data model for analyzing large textual data sets
  • Cleaning real-world data and perform complex regressions
  • Creating visualizations to communicate data analyses
  • Building a final data product in collaboration with SwiftKey, award-winning developer of leading keyboard apps for smartphones

I started with the R programming course where I found the programming assignments to be moderately difficult. They were good practice and also time-consuming for me since I haven’t yet gotten used to the R syntax, which is supposedly unintuitive. Anyway, I completed the course with distinction (90+ marks) scoring 95 on 100, losing 5 because I hadn’t familiarized myself with Git / GitHub. I did this course for a verified certificate, which cost me $29, and looks like this:

Coursera rprog 2015

I won’t be paying for any of the remaining courses though, but still will get a certificate of accomplishment for each course I pass. I have alredy begun with Getting and Cleaning Data and Data Scientist’s Toolbox.

I checked today, and it seems Andrew Ng’s Machine Learning course has gone open to all and is self-paced. A lot of people have gone on to participate in Kaggle competitions with what they learnt in his course, so I’d like to experience it — even though it’s taught with Octave / MATLAB. My very short term goal is to start participating in these competitions ASAP.

Kaggle Competitions

I will be learning the basics of Git this week and along with that, about reading from MySQL, HDF5, the web and APIs. I intend to start reading Trevor Hastie’s highly recommended book, Introduction to Statistical Learning.

ISL Cover 2


Meanwhile, I need to get started with Git and GitHub too, and I found a very useful blog by Kevin Markham and his short concise videos are great introductory material.

Incidentally, I was in a dilemma whether to start with Hastie’s material or Andrew Ng’s course first. This is what Kevin had to say

Hastie or Ng

The only reason I have reservations against Andrew Ng’s course is that its instruction isn’t in R or Python. Also, CTO and co-founder of Kaggle, Ben Hamner mentions here how useful R and Python are vis-à-vis Matlab / Octave.

Ben Hamner on Python R Matlab v2

Hello World!

Hello World

Hi all!

This website would be a most unusual way to blog about programming languages, that too coming from someone who hasn’t done much coding. In the next few minutes, I offer an introduction. It’s divided into 2 parts.

(i) introducing myself
(ii) an introduction to WHY I created this blog

Intro (i)

I am an electrical engineer who took to finance after graduating from college — doing what I’d like to think was preparing client pitches that bankers would use to wrap up multi-million dollar deals!

Just kidding. All I was doing was waiting for the last day of the month for the salary figure to pop up as a message in my phone’s inbox, i.e., watching my bank balance go up every month. It was in a moment of epiphany that I realized that I had better quit before I got used to being that way.

I then spent some time working as a social media analyst for a revolutionary political outfit — around the same time when the capital city of India was going to the polls for the Assembly elections. Politics sparked my curiosity for what was coming next – Economics!

I fell in love immediately, which found me studying economics here, at a research institute funded by the RBI, India’s equivalent of the Fed. I braved a semester, managing a face-saving GPA, for it had been 3 years since I had left academics, and I was moving to something unrelated to what I had been doing in the past, so the transition couldn’t have been smooth, I knew that.

Nevertheless, when I was taking my end of term exams that semester, it was after 2 weeks of hitting the gym. But life has its ways of throwing lemons at us from time to time. I’m now trying to squeeze the juice out of them for the proverbial lemonade. Anyway, I had to cut short my attempts at acquiring a six pack when it started to pain in my pelvic region, and my right leg had gone numb. Through the pain I somehow managed to appear for the end terms. When I was home after my exams, the pain gradually got worse and rose — like a crescendo!

Intro (ii)

Turns out I had what is commonly known as a slipped disc. I had herniations in L4-L5 and L5-S1 discs of my spine, with a 100% prolapse in the latter.

It’s been very painful. I can’t sit for more than 5 minutes without getting muscular spasms in my lumbar region, numbness in my feet and distressing nerve pain in my toes, buttocks and thighs that last for a couple of days each time I try sitting. Can’t stand longer than 10 minutes.

In summary, I’ve been bedridden for over 16 weeks now and have 9 months ahead of me before I can continue my education from where I had to leave it. Staying confined in a room for months on end, sick, is worse than being locked up in prison. It makes going to the doctor seem like a picnic!

I always wanted to get my hands dirty with programming, so I decided after much deliberation, that I would learn as much of Python and R as I can in the coming months. I’ll talk more about WHY, in some of my future posts (like this one), but for now it should suffice if I told you I want to keep myself from getting bored to death. For the months of April through December, this blog is meant to document my learning and struggles, insights and revelations.

What better way to start than this —

> print(“Hello World”)  # R
>>> print “Hello World”  # Python