Scraping the Daily India Covid-19 Tracker for CSV Data

May 10, 2020May 26, 2020 Anirudh Econometrics, Technical Code Snippets, Coding, COVID-19, Data Mining, R, Scraping

This is a very short post that will be very useful to help you quickly set up your COVID-19 datasets. I’m sharing code at the end of this post that scrapes through all CSV datasets made available by COVID19-India API.

We have exposed all the crowdsourced patient details, travel history (published by authorities) and statewise trends in this live API : https://t.co/tNyhpPYTJD A shout-out to all Data Analysts, Planners and Enthusiasts to use this data for helping the containment efforts. @_mekin

— covid19indiaorg (@covid19indiaorg) March 22, 2020

Copy paste this standalone script into your R environment and get going!

There are 15+ CSV files on the India COVID-19 API website. raw_data3 is actually a live dataset and more can be expected in the days to come, which is why a script that automates the data sourcing comes in handy. Snapshot of the file names and the data dimensions as of today, 100 days since the first case was recorded in the state of Kerala —

My own analysis of the data and predictions are work-in-progress, going into a Github repo. Execute the code below and get started analyzing the data and fighting COVID-19!

	rm(list = ls())

	# Load relevant libraries -----------------------------------------------------
	library(stringr)
	library(data.table)

	# =============================================================================
	# COVID 19-India API: A volunteer-driven, crowdsourced database
	# for COVID-19 stats & patient tracing in India
	# =============================================================================

	url <- "https://api.covid19india.org/csv/"

	# List out all CSV files to source --------------------------------------------

	html <- paste(readLines(url), collapse="\n")
	pattern <- "https://api.covid19india.org/csv/latest/.*csv"
	matched <- unlist(str_match_all(string = html, pattern = pattern))

	# Downloading the Data --------------------------------------------------------

	covid_datasets <- lapply(as.list(matched), fread)

	# Naming the data objects appropriately ---------------------------------------


	exclude_chars <- "https://api.covid19india.org/csv/latest/"

	dataset_names <- substr(x = matched,
	start = 1 + nchar(exclude_chars),
	stop = nchar(matched)- nchar(".csv"))

	# assigning variable names
	for(i in seq_along(dataset_names)){
	assign(dataset_names[i], covid_datasets[[i]])
	}

view raw source_patient_data_india.R hosted with ❤ by GitHub

Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

November 8, 2016November 8, 2016 Anirudh Technical Coding, Econometrics, Economics, Machine Learning, R, strucchange, Structural Breaks, Time Series

The most conventional approach to determine structural breaks in longitudinal data seems to be the Chow Test.

From Wikipedia,

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war). In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

As shown in the figure below, regressions on the 2 sub-intervals seem to have greater explanatory power than a single regression over the data.

For the data above, determining the sub-intervals is an easy task. However, things may not look that simple in reality. Conducting a Chow test for structural breaks leaves the data scientist at the mercy of his subjective gaze in choosing a null hypothesis for a break point in the data.

Instead of choosing the breakpoints in an exogenous manner, what if the data itself could learn where these breakpoints lie? Such an endogenous technique is what Bai and Perron came up with in a seminal paper published in 1998 that could detect multiple structural breaks in longitudinal data. A later paper in 2003 dealt with the testing for breaks empirically, using a dynamic programming algorithm based on the Bellman principle.

I will discuss a quick implementation of this technique in R.

Brief Outline:

Assuming you have a ts object (I don’t know whether this works with zoo, but it should) in R, called ts. Then implement the following:

	# assuming you have a 'ts' object in R

	# 1. install package 'strucchange'
	# 2. Then write down this code:

	library(strucchange)

	# store the breakdates
	bp_ts <- breakpoints(ts ~ 1)

	# this will give you the break dates and their confidence intervals
	summary(bp_ts)

	# store the confidence intervals
	ci_ts <- confint(bp_ts)

	## to plot the breakpoints with confidence intervals
	plot(ts)
	lines(bp_ts)
	lines(ci_ts)

view raw strucchange_usage.R hosted with ❤ by GitHub

An illustration

I started with data on India’s rice crop productivity between 1950 (around Independence from British Colonial rule) and 2008. Here’s how it looks:

You can download the excel and CSV files here and here respectively.

Here’s the way to go using R:

	library(xlsx)
	library(forecast)
	library(tseries)
	library(strucchange)

	## load the data from a CSV or Excel file. This example is done with an Excel sheet.
	prod_df <- read.xlsx(file = 'agricultural_productivity.xls', sheetIndex = 'Sheet1', rowIndex = 8:65, colIndex = 2, header = FALSE)
	colnames(prod_df) <- c('Rice')
	## store rice data as time series objects
	rice <- ts(prod_df$Rice, start=c(1951, 1), end=c(2008, 1), frequency=1)

	# store the breakpoints
	bp.rice <- breakpoints(rice ~ 1)
	summary(bp.rice)

	## the BIC chooses 5 breakpoints; plot the graph with breakdates and their confidence intervals
	plot(bp.rice)
	plot(rice)
	lines(bp.rice)

	## confidence intervals
	ci.rice <- confint(bp.rice)
	ci.rice
	lines(ci.rice)

view raw rice_strucchange.R hosted with ❤ by GitHub

Voila, this is what you get:

The dotted vertical lines indicated the break dates; the horizontal red lines indicate their confidence intervals.

This is a quick and dirty implementation. For a more detailed take, check out the documentation on the R package called strucchange.

scikit-learn Linear Regression Example

July 19, 2016July 19, 2016 Anirudh Technical Algorithms, Code Snippets, Coding, example, IPython, Jupyter, Linear Regression, Machine Learning, Python, scikit learn, sklearn

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

view raw linear_regression.ipynb hosted with ❤ by GitHub

Data Manipulation in R with dplyr – Part 3

December 22, 2015December 22, 2015 Anirudh Technical Code Snippets, Coding, Dagwood Sandwich, Data Manipulation, Data Science, dplyr, R

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1 – Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

	# %>% OPERATOR ----------------------------------------------------------------------

	# with %>% operator
	hflights %>%
	mutate(diff = TaxiOut - TaxiIn) %>%
	filter(!is.na(diff)) %>%
	summarise(avg = mean(diff))

	# without %>% operator
	# arguments get further and further apart
	summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)),
	avg = mean(diff))


	# with %>% operator
	d <- hflights %>%
	select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# without %>% operator
	d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime),
	RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)

	# Filter and summarise d
	d %>%
	filter(!is.na(mph), mph < 70) %>%
	summarise(n_less = n(), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Let's define preferable flights as flights that are 150% faster than driving,
	# i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or
	# diverted flights are less preferable than driving.


	# ADVANCED PIPING EXERCISES
	# Use one single piped call to print a summary with the following variables:

	# n_non - the number of non-preferable flights in hflights,
	# p_non - the percentage of non-preferable flights in hflights,
	# n_dest - the number of destinations that non-preferable flights traveled to,
	# min_dist - the minimum distance that non-preferable flights traveled,
	# max_dist - the maximum distance that non-preferable flights traveled

	hflights %>%
	mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>%
	filter(mph < 105 \| Cancelled == 1 \| Diverted == 1) %>%
	summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest),
	min_dist = min(Distance), max_dist = max(Distance))

	# Use summarise() to create a summary of hflights with a single variable, n,
	# that counts the number of overnight flights. These flights have an arrival
	# time that is earlier than their departure time. Only include flights that have
	# no NA values for both DepTime and ArrTime in your count.

	hflights %>%
	mutate(overnight = (ArrTime < DepTime)) %>%
	filter(overnight == TRUE) %>%
	summarise(n = n())

view raw pipingOperator.r hosted with ❤ by GitHub

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

	# group_by() -------------------------------------------------------------------------

	# Generate a per-carrier summary of hflights with the following variables: n_flights,
	# the number of flights flown by the carrier; n_canc, the number of cancelled flights;
	# p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of
	# flights whose delay does not equal NA. Next, order the carriers in the summary from
	# low to high by their average arrival delay. Use percentage of flights cancelled to
	# break any ties. Which airline scores best based on these statistics?

	hflights %>%
	group_by(UniqueCarrier) %>%
	summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights,
	avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay)

	# Generate a per-day-of-week summary of hflights with the variable avg_taxi,
	# the average total taxiing time. Pipe this summary into an arrange() call such
	# that the day with the highest avg_taxi comes first.

	hflights %>%
	group_by(DayOfWeek) %>%
	summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>%
	arrange(desc(avg_taxi))

view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

	# Combine group_by with mutate-----

	# First, discard flights whose arrival delay equals NA. Next, create a by-carrier
	# summary with a single variable: p_delay, the proportion of flights which are
	# delayed at arrival. Next, create a new variable rank in the summary which is a
	# rank according to p_delay. Finally, arrange the observations by this new rank
	hflights %>%
	filter(!is.na(ArrDelay)) %>%
	group_by(UniqueCarrier) %>%
	summarise(p_delay = sum(ArrDelay >0)/n()) %>%
	mutate(rank = rank(p_delay)) %>%
	arrange(rank)

	# n a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA).
	# Next, create a by-carrier summary with a single variable: avg, the average delay
	# of the delayed flights. Again add a new variable rank to the summary according to
	# avg. Finally, arrange by this rank variable.
	hflights %>%
	filter(!is.na(ArrDelay), ArrDelay > 0) %>%
	group_by(UniqueCarrier) %>%
	summarise(avg = mean(ArrDelay)) %>%
	mutate(rank = rank(avg)) %>%
	arrange(rank)

	# Advanced group_by exercises-------------------------------------------------------

	# Which plane (by tail number) flew out of Houston the most times? How many times?
	# Name the column with this frequency n. Assign the result to adv1. To answer this
	# question precisely, you will have to filter() as a final step to end up with only
	# a single observation in adv1.
	# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
	adv1 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n = n()) %>%
	filter(n == max(n))

	# How many airplanes only flew to one destination from Houston? adv2
	# How many airplanes only flew to one destination from Houston?
	# Save the resulting dataset in adv2, that contains only a single column,
	# named nplanes and a single row.
	adv2 <- hflights %>%
	group_by(TailNum) %>%
	summarise(n_dest = n_distinct(Dest)) %>%
	filter(n_dest == 1) %>%
	summarise(nplanes = n())

	# Find the most visited destination for each carrier and save your solution to adv3.
	# Your solution should contain four columns:
	# UniqueCarrier and Dest,
	# n, how often a carrier visited a particular destination,
	# rank, how each destination ranks per carrier. rank should be 1 for every row,
	# as you want to find the most visited destination for each carrier.

	adv3 <- hflights %>%
	group_by(UniqueCarrier, Dest) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

	# Find the carrier that travels to each destination the most: adv4
	# For each destination, find the carrier that travels to that destination the most.
	# Store the result in adv4. Again, your solution should contain 4 columns:
	# Dest, UniqueCarrier, n and rank.

	adv4 <- hflights %>%
	group_by(Dest, UniqueCarrier) %>%
	summarise(n = n()) %>%
	mutate(rank = rank(desc(n))) %>%
	filter(rank == 1)

view raw group_by_miscellaneous.R hosted with ❤ by GitHub

Statistical Learning – 2016

December 13, 2015 Anirudh Non Technical Coding, Machine Learning, MOOC, Programming, R, Stanford University, Statistical Learning, Statistics

On January 12, 2016, Stanford University professors Trevor Hastie and Rob Tibshirani will offer the 3rd iteration of Statistical Learning, a MOOC which first began in January 2014, and has become quite a popular course among data scientists. It is a great place to learn statistical learning (machine learning) methods using the R programming language. For a quick course on R, check this out – Introduction to R Programming

Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani available separately here. Slides and video tutorials related to this book by Abass Al Sharif can be downloaded here.

The course covers the following book which is available for free as a PDF copy.

Logistics and Effort:

Rough Outline of Schedule (based on last year’s course offering):

Week 1: Introduction and Overview of Statistical Learning (Chapters 1-2)
Week 2: Linear Regression (Chapter 3)
Week 3: Classification (Chapter 4)
Week 4: Resampling Methods (Chapter 5)
Week 5: Linear Model Selection and Regularization (Chapter 6)
Week 6: Moving Beyond Linearity (Chapter 7)
Week 7: Tree-based Methods (Chapter 8)
Week 8: Support Vector Machines (Chapter 9)
Week 9: Unsupervised Learning (Chapter 10)

Prerequisites: First courses in statistics, linear algebra, and computing.

Sherlock and the Beast – HackerRank

December 12, 2015December 12, 2015 Anirudh Technical Coding, Decent Number, HackerRank, Math

I found myself stuck on this problem recently. I must confess, I lost a couple of hours trying to get to figure the logic for this one. Here’s the problem:

I’ve written 2 functions to solve this problem. The first one I used for smaller N, say N < 30 and the second one for N > 30. The second function is elegant, and it relies on the mathematical property that if a number N is not divisible by 3, it could either leave a remainder 1 or 2.

If it leaves a remainder 2, then subtracting 5 once would make the number divisible by 3. If it leaves a remainder 1, then subtracting 5 twice would make the number divisible by 3.

We subtract 5 from N iteratively and attempt to divide N into 2 parts, one divisible by 3 and the other divisible by 5. We want the part that is divisible by 3 to be the larger part, so that the associated Decent Number is the largest possible. This explanation might seem obtuse, but if you get pen down on paper, you’ll understand what I mean.

Solution

Solutions to Machine Learning Programming Assignments

November 24, 2015July 25, 2016 Anirudh Technical Andrew Ng, Code Snippets, Coding, Machine Learning, Octave, Python, Solutions

This post contains links to a bunch of code that I have written to complete Andrew Ng’s famous machine learning course which includes several interesting machine learning problems that needed to be solved using the Octave / Matlab programming language. I’m not sure I’d ever be programming in Octave after this course, but learning Octave just so that I could complete this course seemed worth the time and effort. I would usually work on the programming assignments on Sundays and spend several hours coding in Octave, telling myself that I would later replicate the exercises in Python.

If you’ve taken this course and found some of the assignments hard to complete, I think it might not hurt to go check online on how a particular function was implemented. If you end up copying the entire code, it’s probably your loss in the long run. But then John Maynard Keynes once said, ‘In the long run we are all dead‘. Yeah, and we wonder why people call Economics the dismal science!

Most people disregard Coursera’s feeble attempt at reigning in plagiarism by creating an Honor Code, precisely because this so-called code-of-conduct can be easily circumvented. I don’t mind posting solutions to a course’s programming assignments because GitHub is full to the brim with such content. Plus, it’s always good to read others’ code even if you implemented a function correctly. It helps understand the different ways of tackling a given programming problem.

ex1
ex2
ex3
ex4
ex5
ex6
ex7
ex8

Enjoy!

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

October 21, 2015October 21, 2015 Anirudh Non Technical Coding, Data Science, Data Visualization, edX, MIT, MOOC, Python

MIT’s Fall 2015 iteration of 6.00.2x starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. MIT OpenCourseware (OCW) mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

One thing I loved about 6.00.1x was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. 6.00.2x also has a Facebook group. Here’s a sneak peak:

The syllabus and schedule for this course is shown below. The course is spread out over 2 months which includes 7 weeks of lectures.

The prerequisites for this course are pretty much covered in this set of tutorial videos that have been created by one of the TAs for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! 🙂

Python to the Rescue

October 16, 2015October 16, 2015 Anirudh Non Technical Coding, Development Economics, Economics, MIT, Python

Another journal-like entry

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

– Advice from an Old Programmer

I was reading a paper today, written by MIT’s Esther Duflo, part of a homework assignment on a MOOC on development policy (Foundations of Development Policy: Advanced Development Economics) offered by Duflo and Abhijit Banerjee. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary schools throughout the county

I got:

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv wuxfwhg ryhu 94/333 sulpdu| vfkrrov wkurxjkrxw wkh frxqwu|

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that ‘B’ became ‘E’, ‘e’ became ‘h’, ‘t’ became ‘w’ and so on. So I copied the entire content of the PDF to a text file and named the encrypted file estherDuflo.txt. I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file estherDufloDecrypted.txt.

Sample from the Encrypted File

	5U LL?} @?_ w@MLh @h!i\| L?ti^ i?Uit Lu 5U LL
	L?t\|h U\|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U)
	, Tih4i?\|
	,t\| ih # L
	W
	Devwudfw
	Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu\|
	vfkrrov wkurxjkrxw wkh frxqwu\|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq
	uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e\| frpelqlqj glhuhqfhv
	dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg
	e\| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu\| vfkrrov
	ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr
	314< pruh \|hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq
	ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e\| wklv srolf\| dv lqvwuxphqwdo yduldeohv
	iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq
	udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55,
	Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv
	sryhuw\| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf\|pdnhuv1 Iru h{dpsoh/
	dydlodelolw\| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho\| fruuhodwhg zlwk frpsohwhg
	vfkrrolqj ru hquroophqw e\| Qdqf\| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud\| +4<;;, dqg Ohh

view raw estherDuflo.txt hosted with ❤ by GitHub

My Code

	from string import *

	# create decipher dictionary
	l = letters[:26]
	decipher = "".join([l[(i+3)%26] for i in range(len(l))])
	decipher = dict(zip(decipher,l))

	# open and read encrypted text
	filename = 'estherDuflo.txt'
	f = open(filename, 'rw')
	lines = f.readlines()
	lines = [l[:-1] for l in lines]
	# use first 1475 lines only
	newlines = lines[:1475]

	# apply decryption on those 1475 lines
	decipheredLines = []
	for line in newlines:
	x = line.lower()
	s = []
	for letter in x:
	if letter in letters:
	s.append(decipher[letter])
	else:
	s.append(letter)
	s.append('\n')
	decipheredLines.append(''.join(s))

	# write deciphered text to new text file
	decipheredFile = 'estherDufloDeciphered.txt'
	df = open(decipheredFile, 'w')
	for line in decipheredLines:
	df.write("%s" % line)

	# close both text files
	f.close()
	df.close()

view raw estherDuflo.py hosted with ❤ by GitHub

Sample from the Decrypted File

	5r ii?} @?_ t@jie @e!f\| i?qf^ f?rfq ir 5r ii
	i?q\|e r\|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r)
	, qfe4f?\|
	,q\| fe # i
	t
	abstract
	between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar\|
	schools throughout the countr\|1 this is one of the largest school construction programs on
	record1 i evaluate the eect of this program on education and wages b\| combining dierences
	across regions in the number of schools constructed with dierences across cohorts induced
	b\| the timing of the program1 the estimates suggest that the construction of primar\| schools
	led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to
	314< more \|ears of education for each school constructed per 4/333 children in their region
	of birth1 using the variations in schooling generated b\| this polic\| as instrumental variables
	for the impact of education on wages generates estimates of economic returns to education
	ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55,
	the question of whether investment in infrastructure increases human capital and reduces
	povert\| has long been a concern to development economists and polic\|makers1 for e{ample/
	availabilit\| of schooling infrastructure has been shown to be positivel\| correlated with completed
	schooling or enrollment b\| nanc\| birdsall +4<;8, in urban bra}il/ dennis detra\| +4<;;, and lee

view raw estherDufloDecrypted.txt hosted with ❤ by GitHub

Magic 5-gon Ring — Project Euler (Problem 68)

September 13, 2015September 13, 2015 Anirudh Non Technical Algorithms, Code Snippets, Coding, Math, Problem, Project Euler, Puzzles, Python

Yet another exciting math problem that requires an algorithmic approach to arrive at a quick solution! There is a pen-paper approach to it too, but this post assumes we’re more interested in discussing the programming angle.

First, the problem:

Working clockwise, and starting from the group of three with the numerically lowest external node (4,3,2 in this example), each solution can be described uniquely. For example, the above solution can be described by the set: 4,3,2; 6,2,1; 5,1,3.

It is possible to complete the ring with four different totals: 9, 10, 11, and 12. There are eight solutions in total.

Total Solution Set:
9 4,2,3; 5,3,1; 6,1,2
9 4,3,2; 6,2,1; 5,1,3
10 2,3,5; 4,5,1; 6,1,3
10 2,5,3; 6,3,1; 4,1,5
11 1,4,6; 3,6,2; 5,2,4
11 1,6,4; 5,4,2; 3,2,6
12 1,5,6; 2,6,4; 3,4,5
12 1,6,5; 3,5,4; 2,4,6
By concatenating each group it is possible to form 9-digit strings; the maximum string for a 3-gon ring is 432621513.

Problem

Using the numbers 1 to 10, and depending on arrangements, it is possible to form 16- and 17-digit strings. What is the maximum 16-digit string for a “magic” 5-gon ring?

Algorithm

In attempting this problem, I choose to label the 5 inner nodes as i, j, k, l, and m.
α, β, γ, δ, and θ being the corresponding outer nodes.

Let x be the sum total of each triplet line, i.e.,

x = α + i + j = β + j + k = γ + k + l = δ + l + m = θ + m + i

First Observation:
For the string to be 16-digits, 10 has to be in the outer ring, as each number in the inner ring is included in the string twice. Next, we fill the inner ring in an iterative manner.

Second Observation:
There 9 numbers to choose from for the inner ring — 1, 2, 3, 4, 5, 6, 7, 8 and 9.
5 have to be chosen. This can be done in 9C5 = 126 ways.
According to circular permutation, if there are n distinct numbers to be arranged in a circle, this can be done in (n-1)! ways, where (n-1)! = (n-1).(n-2).(n-3)…3.2.1. So 5 distinct numbers can be arranged in 4! permutations, i.e., in 24 ways around a circle, or pentagonal ring, to be more precise.
So in all, this problem can be solved in 126×24 = 3024 iterations.

Third Observation:
For every possible permutation of an inner-ring arrangement, there can be one or more values of x (triplet line-sum) that serve as a possible contenders for a “magic” string whose triplets add up to the same number, x. To ensure this, we only need that the values of α through θ of the outer ring are distinct, different from the inner ring, with the greatest of these equal to 10.
Depending on the relative positioning of the numbers in the inner ring, one can narrow the range of x-values one might have to check for each permutation. To zero-down on such a range, let’s look at an example. Shown in the figure below is a randomly chosen permutation of number in the inner ring – 7, 2, 3, 4 and 5, in that order.

So 10, 9, 8, 6 and 1 must fill the outer circle. It’s easy to notice that the 5, 7 pair is the greatest adjacent pair. So whatever x is, it has to be at least 5 + 7 + 1 = 13 (1 being the smallest number of the outer ring). Likewise, 2, 3 is the smallest adjacent pair, so whatever x is, it can’t be any more than 2 + 3+ 10 = 15 (10 being the largest number of the outer ring). This leaves us with a narrow range of x-values to check – 13, 14 and 15.

Next, we arrange the 5 triplets in clock-wise direction starting with the triplet with the smallest number in the outer ring to form a candidate string. This exercise when done for each of the 3024 permutations will shortlist a range of candidates, of which, the maximum is chosen.

That’s all there is to the problem!

Here’s the Python Code. It executes in about a tenth of a second!

	from itertools import permutations
	from itertools import combinations
	# array of candidate solutions empty at the beginning
	record = []

	# choose 5 numbers for inner cells between 1 and 9; there are 9C5 combinations
	# the problem ask for a 16-digit number, so 10 is not to be included in inner cells
	cells = range(1,10)
	inner_cells = [map(int,comb) for comb in combinations(cells,5)]

	# code to calculate min and max couple in an array
	def minCouple(array):
	answer = array[0]+array[-1]
	for i in xrange(len(array)-1):
	coupleSum = array[i] + array[i+1]
	if coupleSum < answer:
	answer = coupleSum
	return answer

	def maxCouple(array):
	answer = 0
	for i in xrange(len(array)-1):
	if i==0:
	coupleSum = array[0]+ array[-1]
	if coupleSum > answer:
	answer = coupleSum
	else:
	coupleSum = array[i]+ array[i+1]
	if coupleSum > answer:
	answer = coupleSum
	return answer

	# Algorithm
	for array in inner_cells:
	pivot = array[0]
	perm_array = array[1:]
	perms = [map(int,perm) for perm in permutations(perm_array,4)]
	for perm in perms:
	checkArray = perm
	checkArray.insert(0,pivot)
	outerRing = [el for el in range(1,11) if el not in checkArray]
	xMax = minCouple(checkArray) + max(outerRing)
	xMin = maxCouple(checkArray) + min(outerRing)
	if xMax >= xMin:
	for x in xrange(xMin, xMax+1):
	i = checkArray[0]
	j = checkArray[1]
	k = checkArray[2]
	l = checkArray[3]
	m = checkArray[4]

	alpha = x-i-j
	beta = x-j-k
	gamma = x-k-l
	delta = x-l-m
	theta = x-m-i

	outerCalculated = [alpha, beta, gamma, delta, theta]

	if sorted(outerCalculated) == sorted(outerRing):
	a = [alpha, i, j]
	b = [beta, j, k]
	c = [gamma, k, l]
	d = [delta, l, m]
	e = [theta, m, i]
	min_val = min(alpha, beta, gamma, delta, theta)
	if alpha == min_val:
	append = a+b+c+d+e
	elif beta == min_val:
	append = b+c+d+e+a
	elif gamma == min_val:
	append = c+d+e+a+b
	elif delta == min_val:
	append = d+e+a+b+c
	elif theta == min_val:
	append = e+a+b+c+d
	l = [str(i) for i in append]
	s = ''.join(l)
	integer_list = int(s)
	record.append(integer_list)
	print max(record)

view raw euler68.py hosted with ❤ by GitHub

Ans: 6531031914842725

	5U LL?} @?_ w@MLh @h!i\| L?ti^ i?Uit Lu 5U LL
	L?t\|h U\|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U)
	, Tih4i?\|
	,t\| ih # L
	W
	Devwudfw
	Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu\|
	vfkrrov wkurxjkrxw wkh frxqwu\|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq
	uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e\| frpelqlqj glhuhqfhv
	dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg
	e\| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu\| vfkrrov
	ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr
	314< pruh \|hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq
	ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e\| wklv srolf\| dv lqvwuxphqwdo yduldeohv
	iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq
	udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55,
	Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv
	sryhuw\| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf\|pdnhuv1 Iru h{dpsoh/
	dydlodelolw\| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho\| fruuhodwhg zlwk frpsohwhg
	vfkrrolqj ru hquroophqw e\| Qdqf\| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud\| +4<;;, dqg Ohh

	5r ii?} @?_ t@jie @e!f\| i?qf^ f?rfq ir 5r ii
	i?q\|e r\|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r)
	, qfe4f?\|
	,q\| fe # i
	t
	abstract
	between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar\|
	schools throughout the countr\|1 this is one of the largest school construction programs on
	record1 i evaluate the eect of this program on education and wages b\| combining dierences
	across regions in the number of schools constructed with dierences across cohorts induced
	b\| the timing of the program1 the estimates suggest that the construction of primar\| schools
	led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to
	314< more \|ears of education for each school constructed per 4/333 children in their region
	of birth1 using the variations in schooling generated b\| this polic\| as instrumental variables
	for the impact of education on wages generates estimates of economic returns to education
	ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55,
	the question of whether investment in infrastructure increases human capital and reduces
	povert\| has long been a concern to development economists and polic\|makers1 for e{ample/
	availabilit\| of schooling infrastructure has been shown to be positivel\| correlated with completed
	schooling or enrollment b\| nanc\| birdsall +4<;8, in urban bra}il/ dennis detra\| +4<;;, and lee

Discovering Python & R

— my journey as a worker bee in quant finance

Coding

Scraping the Daily India Covid-19 Tracker for CSV Data

Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

scikit-learn Linear Regression Example

Data Manipulation in R with dplyr – Part 3

Statistical Learning – 2016

Sherlock and the Beast – HackerRank

Solutions to Machine Learning Programming Assignments

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

Python to the Rescue

Magic 5-gon Ring — Project Euler (Problem 68)

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this: