Scraping the Daily India Covid-19 Tracker for CSV Data

This is a very short post that will be very useful to help you quickly set up your COVID-19 datasets. I’m sharing code at the end of this post that scrapes through all CSV datasets made available by COVID19-India API.

Copy paste this standalone script into your R environment and get going!

There are 15+ CSV files on the India COVID-19 API website. raw_data3 is actually a live dataset and more can be expected in the days to come, which is why a script that automates the data sourcing comes in handy.  Snapshot of the file names and the data dimensions as of today, 100 days since the first case was recorded in the state of Kerala —

Image

My own analysis of the data and predictions are work-in-progress, going into a Github repo. Execute the code below and get started analyzing the data and fighting COVID-19!

rm(list = ls())
# Load relevant libraries -----------------------------------------------------
library(stringr)
library(data.table)
# =============================================================================
# COVID 19-India API: A volunteer-driven, crowdsourced database
# for COVID-19 stats & patient tracing in India
# =============================================================================
url <- "https://api.covid19india.org/csv/"
# List out all CSV files to source --------------------------------------------
html <- paste(readLines(url), collapse="\n")
pattern <- "https://api.covid19india.org/csv/latest/.*csv"
matched <- unlist(str_match_all(string = html, pattern = pattern))
# Downloading the Data --------------------------------------------------------
covid_datasets <- lapply(as.list(matched), fread)
# Naming the data objects appropriately ---------------------------------------
exclude_chars <- "https://api.covid19india.org/csv/latest/"
dataset_names <- substr(x = matched,
start = 1 + nchar(exclude_chars),
stop = nchar(matched)- nchar(".csv"))
# assigning variable names
for(i in seq_along(dataset_names)){
assign(dataset_names[i], covid_datasets[[i]])
}
Advertisement

Troubleshooting ‘Rattle’ (R library) Installation on Ubuntu

This post pertains to Ubuntu / Debian users only.

rattle is a free graphical interface for data mining with R. I wanted to visualize decision trees and had to install this library.
> install.packages('rattle')
got me the following error message:

configure: error: GTK version 2.8.0 required
ERROR: configuration failed for package ‘RGtk2’

rattle_installationNonZeroExit

This error occurs when attempting to install the RGtk2 package. The install is looking for the header files for GTK. Possibly they are not yet. Luckily the problem can be solved quite easily. Open Terminal (Ctrl + Alt + T) and type in the following commands:


sudo apt-get update
wajig install libgtk2.0-dev

Go back and try installing rattle now with the same command as earlier. It should work. It did for me! As you can see below, decision trees are visualized lot better with rattle than if you used just rpart.

rattle