Scraping the Daily India Covid-19 Tracker for CSV Data

This is a very short post that will be very useful to help you quickly set up your COVID-19 datasets. I’m sharing code at the end of this post that scrapes through all CSV datasets made available by COVID19-India API.

Copy paste this standalone script into your R environment and get going!

There are 15+ CSV files on the India COVID-19 API website. raw_data3 is actually a live dataset and more can be expected in the days to come, which is why a script that automates the data sourcing comes in handy.  Snapshot of the file names and the data dimensions as of today, 100 days since the first case was recorded in the state of Kerala —

Image

My own analysis of the data and predictions are work-in-progress, going into a Github repo. Execute the code below and get started analyzing the data and fighting COVID-19!

rm(list = ls())
# Load relevant libraries -----------------------------------------------------
library(stringr)
library(data.table)
# =============================================================================
# COVID 19-India API: A volunteer-driven, crowdsourced database
# for COVID-19 stats & patient tracing in India
# =============================================================================
url <- "https://api.covid19india.org/csv/"
# List out all CSV files to source --------------------------------------------
html <- paste(readLines(url), collapse="\n")
pattern <- "https://api.covid19india.org/csv/latest/.*csv"
matched <- unlist(str_match_all(string = html, pattern = pattern))
# Downloading the Data --------------------------------------------------------
covid_datasets <- lapply(as.list(matched), fread)
# Naming the data objects appropriately ---------------------------------------
exclude_chars <- "https://api.covid19india.org/csv/latest/"
dataset_names <- substr(x = matched,
start = 1 + nchar(exclude_chars),
stop = nchar(matched)- nchar(".csv"))
# assigning variable names
for(i in seq_along(dataset_names)){
assign(dataset_names[i], covid_datasets[[i]])
}
Advertisement

Endogenously Detecting Structural Breaks in a Time Series: Implementation in R

The most conventional approach to determine structural breaks in longitudinal data seems to be the Chow Test.

From Wikipedia,

The Chow test, proposed by econometrician Gregory Chow in 1960, is a test of whether the coefficients in two linear regressions on different data sets are equal. In econometrics, it is most commonly used in time series analysis to test for the presence of a structural break at a period which can be assumed to be known a priori (for instance, a major historical event such as a war). In program evaluation, the Chow test is often used to determine whether the independent variables have different impacts on different subgroups of the population.

As shown in the figure below, regressions on the 2 sub-intervals seem to have greater explanatory power than a single regression over the data.

440px-chow_test_structural_break

For the data above, determining the sub-intervals is an easy task. However, things may not look that simple in reality. Conducting a Chow test for structural breaks leaves the data scientist at the mercy of his subjective gaze in choosing a null hypothesis for a break point in the data.

Instead of choosing the breakpoints in an exogenous manner, what if the data itself could learn where these breakpoints lie? Such an endogenous technique is what Bai and Perron came up with in a seminal paper published in 1998 that could detect multiple structural breaks in longitudinal data. A later paper in 2003 dealt with the testing for breaks empirically, using a dynamic programming algorithm based on the Bellman principle.

I will discuss a quick implementation of this technique in R.

Brief Outline:

Assuming you have a ts object (I don’t know whether this works with zoo, but it should) in R, called ts. Then implement the following:

# assuming you have a 'ts' object in R
# 1. install package 'strucchange'
# 2. Then write down this code:
library(strucchange)
# store the breakdates
bp_ts <- breakpoints(ts ~ 1)
# this will give you the break dates and their confidence intervals
summary(bp_ts)
# store the confidence intervals
ci_ts <- confint(bp_ts)
## to plot the breakpoints with confidence intervals
plot(ts)
lines(bp_ts)
lines(ci_ts)

An illustration 

I started with data on India’s rice crop productivity between 1950 (around Independence from British Colonial rule) and 2008. Here’s how it looks:

rice_productivity

You can download the excel and CSV files here and here respectively.

Here’s the way to go using R:

library(xlsx)
library(forecast)
library(tseries)
library(strucchange)
## load the data from a CSV or Excel file. This example is done with an Excel sheet.
prod_df <- read.xlsx(file = 'agricultural_productivity.xls', sheetIndex = 'Sheet1', rowIndex = 8:65, colIndex = 2, header = FALSE)
colnames(prod_df) <- c('Rice')
## store rice data as time series objects
rice <- ts(prod_df$Rice, start=c(1951, 1), end=c(2008, 1), frequency=1)
# store the breakpoints
bp.rice <- breakpoints(rice ~ 1)
summary(bp.rice)
## the BIC chooses 5 breakpoints; plot the graph with breakdates and their confidence intervals
plot(bp.rice)
plot(rice)
lines(bp.rice)
## confidence intervals
ci.rice <- confint(bp.rice)
ci.rice
lines(ci.rice)

Voila, this is what you get:

02_rice_multiplebreaks

The dotted vertical lines indicated the break dates; the horizontal red lines indicate their confidence intervals.

This is a quick and dirty implementation. For a more detailed take, check out the documentation on the R package called strucchange.

scikit-learn Linear Regression Example

Here’s a quick example case for implementing one of the simplest of learning algorithms in any machine learning toolbox – Linear Regression. You can download the IPython / Jupyter notebook here so as to play around with the code and try things out yourself.

I’m doing a series of posts on scikit-learn. Its documentation is vast, so unless you’re willing to search for a needle in a haystack, you’re better off NOT jumping into the documentation right away. Instead, knowing chunks of code that do the job might help.

Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

Data Manipulation in R with dplyr – Part 3

This happens to be my 50th blog post – and my blog is 8 months old.

🙂

This post is the third and last post in in a series of posts (Part 1Part 2) on data manipulation with dlpyr. Note that the objects in the code may have been defined in earlier posts and the code in this post is in continuation with code from the earlier posts.

Although datasets can be manipulated in sophisticated ways by linking the 5 verbs of dplyr in conjunction, linking verbs together can be a bit verbose.

Creating multiple objects, especially when working on a large dataset can slow you down in your analysis. Chaining functions directly together into one line of code is difficult to read. This is sometimes called the Dagwood sandwich problem: you have too much filling (too many long arguments) between your slices of bread (parentheses). Functions and arguments get further and further apart.

The %>% operator allows you to extract the first argument of a function from the arguments list and put it in front of it, thus solving the Dagwood sandwich problem.

# %>% OPERATOR ----------------------------------------------------------------------
# with %>% operator
hflights %>%
mutate(diff = TaxiOut - TaxiIn) %>%
filter(!is.na(diff)) %>%
summarise(avg = mean(diff))
# without %>% operator
# arguments get further and further apart
summarize(filter(mutate(hflights, diff = TaxiOut - TaxiIn),!is.na(diff)),
avg = mean(diff))
# with %>% operator
d <- hflights %>%
select(Dest, UniqueCarrier, Distance, ActualElapsedTime) %>%
mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)
# without %>% operator
d <- mutate(select(hflights, Dest, UniqueCarrier, Distance, ActualElapsedTime),
RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60)
# Filter and summarise d
d %>%
filter(!is.na(mph), mph < 70) %>%
summarise(n_less = n(), n_dest = n_distinct(Dest),
min_dist = min(Distance), max_dist = max(Distance))
# Let's define preferable flights as flights that are 150% faster than driving,
# i.e. that travel 105 mph or greater in real time. Also, assume that cancelled or
# diverted flights are less preferable than driving.
# ADVANCED PIPING EXERCISES
# Use one single piped call to print a summary with the following variables:
# n_non - the number of non-preferable flights in hflights,
# p_non - the percentage of non-preferable flights in hflights,
# n_dest - the number of destinations that non-preferable flights traveled to,
# min_dist - the minimum distance that non-preferable flights traveled,
# max_dist - the maximum distance that non-preferable flights traveled
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = Distance/RealTime*60) %>%
filter(mph < 105 | Cancelled == 1 | Diverted == 1) %>%
summarise(n_non = n(), p_non = 100*n_non/nrow(hflights), n_dest = n_distinct(Dest),
min_dist = min(Distance), max_dist = max(Distance))
# Use summarise() to create a summary of hflights with a single variable, n,
# that counts the number of overnight flights. These flights have an arrival
# time that is earlier than their departure time. Only include flights that have
# no NA values for both DepTime and ArrTime in your count.
hflights %>%
mutate(overnight = (ArrTime < DepTime)) %>%
filter(overnight == TRUE) %>%
summarise(n = n())

group_by()

group_by() defines groups within a data set. Its influence becomes clear when calling summarise() on a grouped dataset. Summarizing statistics are calculated for the different groups separately.

# group_by() -------------------------------------------------------------------------
# Generate a per-carrier summary of hflights with the following variables: n_flights,
# the number of flights flown by the carrier; n_canc, the number of cancelled flights;
# p_canc, the percentage of cancelled flights; avg_delay, the average arrival delay of
# flights whose delay does not equal NA. Next, order the carriers in the summary from
# low to high by their average arrival delay. Use percentage of flights cancelled to
# break any ties. Which airline scores best based on these statistics?
hflights %>%
group_by(UniqueCarrier) %>%
summarise(n_flights = n(), n_canc = sum(Cancelled), p_canc = 100*n_canc/n_flights,
avg_delay = mean(ArrDelay, na.rm = TRUE)) %>% arrange(avg_delay)
# Generate a per-day-of-week summary of hflights with the variable avg_taxi,
# the average total taxiing time. Pipe this summary into an arrange() call such
# that the day with the highest avg_taxi comes first.
hflights %>%
group_by(DayOfWeek) %>%
summarize(avg_taxi = mean(TaxiIn + TaxiOut, na.rm = TRUE)) %>%
arrange(desc(avg_taxi))
view raw group_by.R hosted with ❤ by GitHub

Combine group_by with mutate

group_by() can also be combined with mutate(). When you mutate grouped data, mutate() will calculate the new variables independently for each group. This is particularly useful when mutate() uses the rank() function, that calculates within group rankings. rank() takes a group of values and calculates the rank of each value within the group, e.g.

rank(c(21, 22, 24, 23))

has output

[1] 1 2 4 3

As with arrange(), rank() ranks values from the largest to the smallest and this behaviour can be reversed with the desc() function.

# Combine group_by with mutate-----
# First, discard flights whose arrival delay equals NA. Next, create a by-carrier
# summary with a single variable: p_delay, the proportion of flights which are
# delayed at arrival. Next, create a new variable rank in the summary which is a
# rank according to p_delay. Finally, arrange the observations by this new rank
hflights %>%
filter(!is.na(ArrDelay)) %>%
group_by(UniqueCarrier) %>%
summarise(p_delay = sum(ArrDelay >0)/n()) %>%
mutate(rank = rank(p_delay)) %>%
arrange(rank)
# n a similar fashion, keep flights that are delayed (ArrDelay > 0 and not NA).
# Next, create a by-carrier summary with a single variable: avg, the average delay
# of the delayed flights. Again add a new variable rank to the summary according to
# avg. Finally, arrange by this rank variable.
hflights %>%
filter(!is.na(ArrDelay), ArrDelay > 0) %>%
group_by(UniqueCarrier) %>%
summarise(avg = mean(ArrDelay)) %>%
mutate(rank = rank(avg)) %>%
arrange(rank)
# Advanced group_by exercises-------------------------------------------------------
# Which plane (by tail number) flew out of Houston the most times? How many times?
# Name the column with this frequency n. Assign the result to adv1. To answer this
# question precisely, you will have to filter() as a final step to end up with only
# a single observation in adv1.
# Which plane (by tail number) flew out of Houston the most times? How many times? adv1
adv1 <- hflights %>%
group_by(TailNum) %>%
summarise(n = n()) %>%
filter(n == max(n))
# How many airplanes only flew to one destination from Houston? adv2
# How many airplanes only flew to one destination from Houston?
# Save the resulting dataset in adv2, that contains only a single column,
# named nplanes and a single row.
adv2 <- hflights %>%
group_by(TailNum) %>%
summarise(n_dest = n_distinct(Dest)) %>%
filter(n_dest == 1) %>%
summarise(nplanes = n())
# Find the most visited destination for each carrier and save your solution to adv3.
# Your solution should contain four columns:
# UniqueCarrier and Dest,
# n, how often a carrier visited a particular destination,
# rank, how each destination ranks per carrier. rank should be 1 for every row,
# as you want to find the most visited destination for each carrier.
adv3 <- hflights %>%
group_by(UniqueCarrier, Dest) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)
# Find the carrier that travels to each destination the most: adv4
# For each destination, find the carrier that travels to that destination the most.
# Store the result in adv4. Again, your solution should contain 4 columns:
# Dest, UniqueCarrier, n and rank.
adv4 <- hflights %>%
group_by(Dest, UniqueCarrier) %>%
summarise(n = n()) %>%
mutate(rank = rank(desc(n))) %>%
filter(rank == 1)

 

 

Statistical Learning – 2016

On January 12, 2016, Stanford University professors Trevor Hastie and Rob Tibshirani will offer the 3rd iteration of Statistical Learning, a MOOC which first began in January 2014, and has become quite a popular course among data scientists. It is a great place to learn statistical learning (machine learning) methods using the R programming language. For a quick course on R, check this out – Introduction to R Programming

Slides and videos for Statistical Learning MOOC by Hastie and Tibshirani available separately here. Slides and video tutorials related to this book by Abass Al Sharif can be downloaded here.

The course covers the following book which is available for free as a PDF copy.

Logistics and Effort:

statLearnEffort

Rough Outline of Schedule (based on last year’s course offering):

Week 1: Introduction and Overview of Statistical Learning (Chapters 1-2)
Week 2: Linear Regression (Chapter 3)
Week 3: Classification (Chapter 4)
Week 4: Resampling Methods (Chapter 5)
Week 5: Linear Model Selection and Regularization (Chapter 6)
Week 6: Moving Beyond Linearity (Chapter 7)
Week 7: Tree-based Methods (Chapter 8)
Week 8: Support Vector Machines (Chapter 9)
Week 9: Unsupervised Learning (Chapter 10)

Prerequisites: First courses in statistics, linear algebra, and computing.

 

Sherlock and the Beast – HackerRank

I found myself stuck on this problem recently. I must confess, I lost a couple of hours trying to get to figure the logic for this one. Here’s the problem:

sherlockAndTheBeast

I’ve written 2 functions to solve this problem. The first one I used for smaller N, say N < 30 and the second one for N > 30. The second function is elegant, and it relies on the mathematical property that if a number N is not divisible by 3, it could either leave a remainder 1 or 2.

If it leaves a remainder 2, then subtracting 5 once would make the number divisible by 3. If it leaves a remainder 1, then subtracting 5 twice would make the number divisible by 3.

We subtract 5 from N iteratively and attempt to divide N into 2 parts, one divisible by 3 and the other divisible by 5. We want the part that is divisible by 3 to be the larger part, so that the associated Decent Number is the largest possible. This explanation might seem obtuse, but if you get pen down on paper, you’ll understand what I mean.

Solution

sherlockAndTheBeastSol

Solutions to Machine Learning Programming Assignments

This post contains links to a bunch of code that I have written to complete Andrew Ng’s famous machine learning course which includes several interesting machine learning problems that needed to be solved using the Octave / Matlab programming language. I’m not sure I’d ever be programming in Octave after this course, but learning Octave just so that I could complete this course seemed worth the time and effort. I would usually work on the programming assignments on Sundays and spend several hours coding in Octave, telling myself that I would later replicate the exercises in Python.

If you’ve taken this course and found some of the assignments hard to complete, I think it might not hurt to go check online on how a particular function was implemented. If you end up copying the entire code, it’s probably your loss in the long run. But then John Maynard Keynes once said, ‘In the long run we are all dead‘. Yeah, and we wonder why people call Economics the dismal science!

Most people disregard Coursera’s feeble attempt at reigning in plagiarism by creating an Honor Code, precisely because this so-called code-of-conduct can be easily circumvented. I don’t mind posting solutions to a course’s programming assignments because GitHub is full to the brim with such content. Plus, it’s always good to read others’ code even if you implemented a function correctly. It helps understand the different ways of tackling a given programming problem.

ex1
ex2
ex3
ex4
ex5
ex6
ex7
ex8

Enjoy!

 

MITx 6.00.2x Introduction to Computational Thinking and Data Science (Fall 2015)

MIT’s Fall 2015 iteration of 6.00.2x starts today. After an enriching learning experience with 6.00.1x, I have great expectations from this course. As the course website mildly puts it, 6.00.2x is an introduction to using computation to understand real-world phenomena. MIT OpenCourseware (OCW) mirroring the material covered in 6.00.1x and 6.00.2x can be found here.

The course follows this book by John Guttag (who happens to be one of the instructors for this course). However, purchasing the book isn’t a necessity for this course.

Introduction to Computation and Programming Using Python

One thing I loved about 6.00.1x was its dedicated Facebook group, which gave a community / classroom-peergroup feel to the course. 6.00.2x also has a Facebook group. Here’s a sneak peak:

descriptionUpdate

The syllabus and schedule for this course is shown below. The course is spread out over 2 months which includes 7 weeks of lectures.

MITx 6.00.2x Fall 2015 Course Calendar
MITx 6.00.2x Fall 2015 Course Calendar

The prerequisites for this course are pretty much covered in this set of tutorial videos that have been created by one of the TAs for 6.00.1x. If you’ve not taken 6.00.1x in the past, you can go through these videos (running time < 1hr) to judge whether or not to go ahead with 6.00.2x.

So much for the update. Got work to do! 🙂

Python to the Rescue

Another journal-like entry

Programming as a profession is only moderately interesting. It can be a good job, but you could make about the same money and be happier running a fast food joint. You’re much better off using code as your secret weapon in another profession.

People who can code in the world of technology companies are a dime a dozen and get no respect. People who can code in biology, medicine, government, sociology, physics, history, and mathematics are respected and can do amazing things to advance those disciplines.

Advice from an Old Programmer

I was reading a paper today, written by MIT’s Esther Duflo, part of a homework assignment on a MOOC on development policy (Foundations of Development Policy: Advanced Development Economics) offered by Duflo and Abhijit Banerjee. So I opened the paper and started copying important lines from the PDF to a text editor to make notes. I could copy the text, but when I pasted it onto a text editor, it turned out to be gibberish (you can try it too!).

For instance, instead of pasting

Between 1973 and 1978 the Indonesian Government constructed over 61,000 primary schools throughout the county

I got:

Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqv wuxfwhg ryhu 94/333 sulpdu| vfkrrov wkurxjkrxw wkh frxqwu|

It was a good thing the cipher used for this text wasn’t too complicated. After some perusal, I found that ‘B’ became ‘E’, ‘e’ became ‘h’, ‘t’ became ‘w’ and so on. So I copied the entire content of the PDF to a text file and named the encrypted file estherDuflo.txt. I noticed that the encryption had been implemented only on the first 1475 lines. The remaining was plain English.

So I wrote a Python script to decrypt the gibberish, rather than simply typing out my notes. It took 20 minutes writing the code and 8 ms to execute (of course!). I didn’t want to spend a lot of time ensuring a thorough decryption, so the result wasn’t perfect, but then I’m going to make do. I named the decrypted file estherDufloDecrypted.txt.

Sample from the Encrypted File

5U LL*?} @?_ w@MLh @h!i| L?ti^ i?Uit Lu 5U LL*
L?t|h U|L? ? W?_L?it@G ,_i?Ui uhL4 @? N? t @* L*U)
, Tih4i?|
,t| ih # L
W
Devwudfw
Ehwzhhq 4<:6 dqg 4<:;/ wkh Lqgrqhvldq Jryhuqphqw frqvwuxfwhg ryhu 94/333 sulpdu|
vfkrrov wkurxjkrxw wkh frxqwu|1 Wklv lv rqh ri wkh odujhvw vfkrro frqvwuxfwlrq surjudpv rq
uhfrug1 L hydoxdwh wkh hhfw ri wklv surjudp rq hgxfdwlrq dqg zdjhv e| frpelqlqj glhuhqfhv
dfurvv uhjlrqv lq wkh qxpehu ri vfkrrov frqvwuxfwhg zlwk glhuhqfhv dfurvv frkruwv lqgxfhg
e| wkh wlplqj ri wkh surjudp1 Wkh hvwlpdwhv vxjjhvw wkdw wkh frqvwuxfwlrq ri sulpdu| vfkrrov
ohg wr dq lqfuhdvh lq hgxfdwlrq dqg hduqlqjv1 Fkloguhq djhg 5 wr 9 lq 4<:7 uhfhlyhg 3145 wr
314< pruh |hduv ri hgxfdwlrq iru hdfk vfkrro frqvwuxfwhg shu 4/333 fkloguhq lq wkhlu uhjlrq
ri eluwk1 Xvlqj wkh yduldwlrqv lq vfkrrolqj jhqhudwhg e| wklv srolf| dv lqvwuxphqwdo yduldeohv
iru wkh lpsdfw ri hgxfdwlrq rq zdjhv jhqhudwhv hvwlpdwhv ri hfrqrplf uhwxuqv wr hgxfdwlrq
udqjlqj iurp 91; shufhqw wr 4319 shufhqw1 +MHO L5/ M64/ R48/ R55,
Wkh txhvwlrq ri zkhwkhu lqyhvwphqw lq lqiudvwuxfwxuh lqfuhdvhv kxpdq fdslwdo dqg uhgxfhv
sryhuw| kdv orqj ehhq d frqfhuq wr ghyhorsphqw hfrqrplvwv dqg srolf|pdnhuv1 Iru h{dpsoh/
dydlodelolw| ri vfkrrolqj lqiudvwuxfwxuh kdv ehhq vkrzq wr eh srvlwlyho| fruuhodwhg zlwk frpsohwhg
vfkrrolqj ru hquroophqw e| Qdqf| Elugvdoo +4<;8, lq xuedq Eud}lo/ Ghqqlv GhWud| +4<;;, dqg Ohh
view raw estherDuflo.txt hosted with ❤ by GitHub

My Code
from string import *
# create decipher dictionary
l = letters[:26]
decipher = "".join([l[(i+3)%26] for i in range(len(l))])
decipher = dict(zip(decipher,l))
# open and read encrypted text
filename = 'estherDuflo.txt'
f = open(filename, 'rw')
lines = f.readlines()
lines = [l[:-1] for l in lines]
# use first 1475 lines only
newlines = lines[:1475]
# apply decryption on those 1475 lines
decipheredLines = []
for line in newlines:
x = line.lower()
s = []
for letter in x:
if letter in letters:
s.append(decipher[letter])
else:
s.append(letter)
s.append('\n')
decipheredLines.append(''.join(s))
# write deciphered text to new text file
decipheredFile = 'estherDufloDeciphered.txt'
df = open(decipheredFile, 'w')
for line in decipheredLines:
df.write("%s" % line)
# close both text files
f.close()
df.close()
view raw estherDuflo.py hosted with ❤ by GitHub

Sample from the Decrypted File
5r ii*?} @?_ t@jie @e!f| i?qf^ f?rfq ir 5r ii*
i?q|e r|i? ? t?_i?fq@d ,_f?rf rei4 @? k? q @* i*r)
, qfe4f?|
,q| fe # i
t
abstract
between 4<:6 and 4<:;/ the indonesian government constructed over 94/333 primar|
schools throughout the countr|1 this is one of the largest school construction programs on
record1 i evaluate the eect of this program on education and wages b| combining dierences
across regions in the number of schools constructed with dierences across cohorts induced
b| the timing of the program1 the estimates suggest that the construction of primar| schools
led to an increase in education and earnings1 children aged 5 to 9 in 4<:7 received 3145 to
314< more |ears of education for each school constructed per 4/333 children in their region
of birth1 using the variations in schooling generated b| this polic| as instrumental variables
for the impact of education on wages generates estimates of economic returns to education
ranging from 91; percent to 4319 percent1 +jel i5/ j64/ o48/ o55,
the question of whether investment in infrastructure increases human capital and reduces
povert| has long been a concern to development economists and polic|makers1 for e{ample/
availabilit| of schooling infrastructure has been shown to be positivel| correlated with completed
schooling or enrollment b| nanc| birdsall +4<;8, in urban bra}il/ dennis detra| +4<;;, and lee

Magic 5-gon Ring — Project Euler (Problem 68)

Yet another exciting math problem that requires an algorithmic approach to arrive at a quick solution! There is a pen-paper approach to it too, but this post assumes we’re more interested in discussing the programming angle.

First, the problem:

Working clockwise, and starting from the group of three with the numerically lowest external node (4,3,2 in this example), each solution can be described uniquely. For example, the above solution can be described by the set: 4,3,2; 6,2,1; 5,1,3.

It is possible to complete the ring with four different totals: 9, 10, 11, and 12. There are eight solutions in total.

Total Solution Set:
9 4,2,3; 5,3,1; 6,1,2
9 4,3,2; 6,2,1; 5,1,3
10 2,3,5; 4,5,1; 6,1,3
10 2,5,3; 6,3,1; 4,1,5
11 1,4,6; 3,6,2; 5,2,4
11 1,6,4; 5,4,2; 3,2,6
12 1,5,6; 2,6,4; 3,4,5
12 1,6,5; 3,5,4; 2,4,6
By concatenating each group it is possible to form 9-digit strings; the maximum string for a 3-gon ring is 432621513.

Problem

Using the numbers 1 to 10, and depending on arrangements, it is possible to form 16- and 17-digit strings. What is the maximum 16-digit string for a “magic5-gon ring?

Algorithm

In attempting this problem, I choose to label the 5 inner nodes as i, j, k, l, and m.
α, β, γ, δ, and θ being the corresponding outer nodes.

Let x be the sum total of each triplet line, i.e.,

x = α + i + j = β + j + k = γ + k + l = δ + l + m = θ + m + i

magic5gon

First Observation:
For the string to be 16-digits, 10 has to be in the outer ring, as each number in the inner ring is included in the string twice. Next, we fill the inner ring in an iterative manner.

Second Observation:
There 9 numbers to choose from for the inner ring — 1, 2, 3, 4, 5, 6, 7, 8 and 9.
5 have to be chosen. This can be done in 9C5 = 126 ways.
According to circular permutation, if there are n distinct numbers to be arranged in a circle, this can be done in (n-1)! ways, where (n-1)! = (n-1).(n-2).(n-3)…3.2.1. So 5 distinct numbers can be arranged in 4! permutations, i.e., in 24 ways around a circle, or pentagonal ring, to be more precise.
So in all, this problem can be solved in 126×24 = 3024 iterations.

Third Observation:
For every possible permutation of an inner-ring arrangement, there can be one or more values of x (triplet line-sum) that serve as a possible contenders for a “magic” string whose triplets add up to the same number, x. To ensure this, we only need that the values of α through θ of the outer ring are distinct, different from the inner ring, with the greatest of these equal to 10.
Depending on the relative positioning of the numbers in the inner ring, one can narrow the range of x-values one might have to check for each permutation. To zero-down on such a range, let’s look at an example. Shown in the figure below is a randomly chosen permutation of number in the inner ring – 7, 2, 3, 4 and 5, in that order.

magic5gonInstance

So 10, 9, 8, 6 and 1 must fill the outer circle. It’s easy to notice that the 5, 7 pair is the greatest adjacent pair. So whatever x is, it has to be at least 5 + 7 + 1 = 13 (1 being the smallest number of the outer ring). Likewise,  2, 3 is the smallest adjacent pair, so whatever x is, it can’t be any more than 2 + 3+ 10 = 15 (10 being the largest number of the outer ring). This leaves us with a narrow range of x-values to check – 13, 14 and 15.

Next, we arrange the 5 triplets in clock-wise direction starting with the triplet with the smallest number in the outer ring to form a candidate string. This exercise when done for each of the 3024 permutations will shortlist a range of candidates, of which, the maximum is chosen.

That’s all there is to the problem!

Here’s the Python Code. It executes in about a tenth of a second!

from itertools import permutations
from itertools import combinations
# array of candidate solutions empty at the beginning
record = []
# choose 5 numbers for inner cells between 1 and 9; there are 9C5 combinations
# the problem ask for a 16-digit number, so 10 is not to be included in inner cells
cells = range(1,10)
inner_cells = [map(int,comb) for comb in combinations(cells,5)]
# code to calculate min and max couple in an array
def minCouple(array):
answer = array[0]+array[-1]
for i in xrange(len(array)-1):
coupleSum = array[i] + array[i+1]
if coupleSum < answer:
answer = coupleSum
return answer
def maxCouple(array):
answer = 0
for i in xrange(len(array)-1):
if i==0:
coupleSum = array[0]+ array[-1]
if coupleSum > answer:
answer = coupleSum
else:
coupleSum = array[i]+ array[i+1]
if coupleSum > answer:
answer = coupleSum
return answer
# Algorithm
for array in inner_cells:
pivot = array[0]
perm_array = array[1:]
perms = [map(int,perm) for perm in permutations(perm_array,4)]
for perm in perms:
checkArray = perm
checkArray.insert(0,pivot)
outerRing = [el for el in range(1,11) if el not in checkArray]
xMax = minCouple(checkArray) + max(outerRing)
xMin = maxCouple(checkArray) + min(outerRing)
if xMax >= xMin:
for x in xrange(xMin, xMax+1):
i = checkArray[0]
j = checkArray[1]
k = checkArray[2]
l = checkArray[3]
m = checkArray[4]
alpha = x-i-j
beta = x-j-k
gamma = x-k-l
delta = x-l-m
theta = x-m-i
outerCalculated = [alpha, beta, gamma, delta, theta]
if sorted(outerCalculated) == sorted(outerRing):
a = [alpha, i, j]
b = [beta, j, k]
c = [gamma, k, l]
d = [delta, l, m]
e = [theta, m, i]
min_val = min(alpha, beta, gamma, delta, theta)
if alpha == min_val:
append = a+b+c+d+e
elif beta == min_val:
append = b+c+d+e+a
elif gamma == min_val:
append = c+d+e+a+b
elif delta == min_val:
append = d+e+a+b+c
elif theta == min_val:
append = e+a+b+c+d
l = [str(i) for i in append]
s = ''.join(l)
integer_list = int(s)
record.append(integer_list)
print max(record)
view raw euler68.py hosted with ❤ by GitHub

Ans: 6531031914842725