Using R to download zipped data file, extract, and import .csv - r

I am trying to download and extract a .csv file from a webpage using R.
This question is a duplicate of Using R to download zipped data file, extract, and import data.
I cannot get the solution to work, but it may be due to the web address i am using.
I am trying to download the .csv files from http://data.worldbank.org/country/united-kingdom (under the download data drop down)
Using #Dirk's solution from the link above, i tried
temp <- tempfile()
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",temp)
con <- unz(temp, "gbr_Country_en_csv_v2.csv")
dat <- read.table(con, header=T, skip=2)
unlink(temp)
I got the extended link by looking at the page source code, which I expect is causing the problems, although it works if i paste it into the address bar.
The file downloads with the correct Gb
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",temp)
# trying URL 'http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv'
# Content type 'application/zip' length 332358 bytes (324 Kb)
# opened URL
# downloaded 324 Kb
# also tried unzip but get this warning
con <- unzip(temp, "gbr_Country_en_csv_v2.csv")
# Warning message:
# In unzip(temp, "gbr_Country_en_csv_v2.csv") :
# requested file not found in the zip file
But these are the file names when i manually download them.
I'd appreciate some help with where i am going wrong , thanks
I am using Windows 8, R version 3.1.0

In order to get your data to download and uncompress, you need to set mode="wb"
download.file("...",temp, mode="wb")
unzip(temp, "gbr_Country_en_csv_v2.csv")
dd <- read.table("gbr_Country_en_csv_v2.csv", sep=",",skip=2, header=T)
It looks like the default is "w" which assumes a text files. If it was a plain csv file this would be fine. But since it's compressed, it's a binary file, hence the "wb". Without the "wb" part, you can't open the zip at all.

It's almost everything ok. In this case you only need to specify that it's a comma separated file, eg using sep="," in read.table:
temp <- tempfile()
download.file("http://api.worldbank.org/v2/en/country/gbr?downloadformat=csv",
temp)
con <- unz(temp, "gbr_Country_en_csv_v2.csv")
dat <- read.table(con, header=T, skip=2, sep=",")
unlink(temp)
With this little change i can import your csv smoothly.
HTH, Luca

The Word Bank Developmet Indictors can be obtained using the WDI package. For example,
library(WDI)
inds <- WDIsearch(field = "indicator")[, 1]
GB <- WDI("GB", indicator = inds)
See WDIsearch and WDI functions and the rerference manual for more info.

Related

Problem with XLS files with R's package readxl

I need to read a XLS file in R, but I'm having a problem regarding the way my file is generated and the R function readxl. I do not have this issue with Python, and this is my hope that it's possible to solve this problem inside R.
An application we use at my company exports reports in XLS format (not XLSX). This report is generated daily. What I need is to sum the total value of the rows in each file, in order to create a new report containing each day followed by this total value.
When I try to read these files in R using the readxl package, the program returns this error:
Erro: Can't subset columns that don't exist.
x Location 5 doesn't exist.
i There are only 0 columns.
Run rlang::last_error() to see where the error occurred.
Now, the weird thing is that, when I open the XLS file on Excel before running my script, R is able to run properly.
I guesses this was an error caused by something like the file only being completed when I open it... but the same python script does give me the correct result.
I am now assuming this is a bug in the readxl package. Is there another package I could use to run XLS (and not XLSX)? One that does not depend on Java installed on my computer, I mean.
my readxl script:
if (!require("readxl")) {install.packages("readxl"); library("readxl")}
"%,%" <- function(x,y) paste0(x,"\\",y)
year = "2021"
month = "Aug"
column = 5 # VL_COVAR
path <- "F:\\variancia" %,% year %,% month
tiposDF = c("date","numeric","list","numeric","numeric","numeric","list")
file.names <- dir(path, pattern =".xls")
vari <- c()
for (i in 1:length(file.names)){
file <- paste(path,sep="\\",file.names[i])
print(paste("Reading ", file))
dados <- read_excel(file, col_types = tiposDF)
somaVar <- sum(dados[column])
vari <- append(vari,c(somaVar))
}
vari
file <- paste(path,sep="\\",'Covariância.xls_02082021.xls')
print(paste("Reading ", file))
dados <- read_excel(file, col_types = tiposDF)
somaVar <- sum(dados[column])
vari <- append(vari,c(somaVar))
x <- import(file)
View(x)
Thanks everyone!

How to open .json.gz format file and create dataframe n R?

I have downloaded abcd.json.gz file. The file size is 10GB I want to parse it and create data frame from it in an efficient way. How to do it in R ?
This is how I do it without unpacking the file first:
con <- gzfile("file.json.gz", "rt")
lines <- readLines(con)
close(con)
df <- jsonlite::stream_in(textConnection(lines), verbose = FALSE)
This uses line-by-line processing of JSON, which is the only JSON format I've encountered in the wild so far.

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.
Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

Error opening .nc files in R

I am new to R and NetCDF files. I am trying to open data on surface sea temperatures that are in a .nc file from here. My code is as follows:
rm(list=ls())
#install.packages(c("netcdf", "chron", "RColorBrewer", "lattice"))
library(chron)
library(RColorBrewer)
library(lattice)
library(ncdf4)
# set path and filename
ncpath <- "C:/Users/Mihir Sharma/Dropbox/data/"
ncname <- "20140101023458-NCEI-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.3_NOAA19_G_2014001_night-v02.0-fv01.0"
ncfname <- paste(ncpath, ncname, ".nc", sep="")
dname <- "tmp" # note: tmp means temperature (not temporary)
# open a NetCDF file
ncin <- nc_open(ncfname)
But I am getting the following error:
Error in nc_open(ncfname) :
Error in nc_open trying to open file C:/Users/Mihir Sharma/Dropbox/1 EPIC/MPA/data/20140101023458-NCEI-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.3_NOAA19_G_2014001_night-v02.0-fv01.0.nc
I have followed the code from here and here. what am I doing wrong?
Many thanks,
Mihir
Path problems when using someone else's code:
# set path and filename
ncpath <- "C:/Users/Mihir Sharma/Dropbox/data/"
When you're following code on a blog or tutorial, if it's written well, they'll use a platform independent way to describe, but often they won't.
The platform independent way to write a path is:
file.path("Users", "Mihir Sharma", "Dropbox", "data"). This will always use the correct file separator for your Platform, a value stored in .Platform$file.sep

Parse multiple XBRL files stored in a zip file

I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).
It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)
Example file (sorry it is a bit big) using code from a
previous question
- download one zip file
library(XML)
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]
dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))
I can parse the files using the
XBRL package if i manually extract them.
This can be done as follows
library(XBRL)
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)
I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them.
I tried making a start, but don't know how to progress from here. Thanks for any advice.
# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626
# unzip and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)
I am using Windows 8.1
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.
# Parse one zip file to start
fls <- list.files(temp)[[1]]
# Unzip
tmp <- tempdir()
lst <- unzip(file.path(temp, fls), exdir=tmp)
# Only parse first 10 records
inst <- lst[1:10]
# Start to parse - in parallel
cl <- makeCluster(parallel::detectCores())
clusterCall(cl, function() library(XBRL))
# Start
st <- Sys.time()
out <- parLapply(cl, inst, function(i)
xbrlDoAll(i,
cache.dir="temp/hmrcCache",
prefix.out=NULL, verbose=T) )
stopCluster(cl)
Sys.time() - st

Resources