I read a lot of files into R from zipped sources. I try to use the R function unz to read from zipped files because unlike unzip it does not leave any unzipped files on my harddisk.
However, this does not seem to work for zipped *.dta (Stata) files:
library(foreign)
temp <- tempfile()
download.file("http://databank.worldbank.org/data/download/WDI_csv.zip", temp)
wdi_unz <- read.csv(unz(temp, "WDI_Data.csv"))
unlink(temp)
temp <- tempfile()
download.file("http://www.rug.nl/research/ggdc/data/pwt/v80/pwt80.zip",temp)
pwt_unzip <- read.dta(unzip(temp, "pwt80.dta"))
pwt_unz <- read.dta(unz(temp, "pwt80.dta"))
unlink(temp)
Sorry for using the rather large World Development Indicators database (its 40+ MB), but I did not find any better working example.
The code produces an error when reading pwt_unz, [edit: but not when reading pwt_unzip]. What is the problem there? Probably it has something to do with the return value of unz not being compatible with the input for read.dta?
I think you need read.dta
Have a look here :
http://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.dta.html
Related
I am using one of the files in here: http://orca.science.oregonstate.edu/1080.by.2160.monthly.hdf.vgpm.m.chl.m.sst.php:
untar(tarfile = "http://orca.science.oregonstate.edu/data/1x2/monthly/vgpm.r2018.m.chl.m.sst/hdf/vgpm.m.2010.tar", exdir = "./foo")
I get error: ar.exe: Error opening archive: Failed to open 'http://orca.science.oregonstate.edu/data/1x2/monthly/vgpm.r2018.m.chl.m.sst/hdf/vgpm.m.2010.tar'
so I manually had to download the file and untar it ( that is why cant provide a reproducible example here). Inside there are files of .hdf format:
I also was not able to read them:
library(ncdf4)
ncin <- nc_open(".\\vgpm.m.2010\\vgpm.2010001.hdf")
ncin
lon <- ncvar_get(ncin,"fakeDim0")
head(lon)
lat <- ncvar_get(ncin,"fakeDim1")
head(lat)
fillvalue <- ncatt_get(ncin,"npp","_FillValue")
Can you please help to explain why i cant utar the file and why .hdf files have no fill value?
You should be able to untar the file once you have downloaded it. Download the file first to your working directory, then untar from your working directory: untar("vgpm.m.2002.tar", exdir = "mydir"). Your issue is likely with the connection. There can be many reasons for that which are specific to your computer's settings. You'll need to troubleshoot that separately.
Once you untar the directory, the contents inside are not .hdf files. They are compressed .hdf files (thus why their file names end in .gz). You'll need to decompress:
library(R.utils)
gunzip("mydir/vgpm.2002335.hdf.gz", remove = FALSE)
Once you actually have the .hdf file, you need to open it and then read it. You are correct to use ncdf4 because it accommodates multiple .hdf file formats. Some of the older formats would need different packages or software.
To open and read it, you need two different functions, nc_open() and ncvar_get():
library(ncdf4)
dat <- nc_open("mydir/vgpm.2002335.hdf", write = TRUE)
ncvar_get(dat)
Note that these functions will NOT work if you have not completed the pre-requisite set-up explained in detail in the documentation. For example:
Both the netCDF library and the HDF5 library must already be installed on your machine for this R interface to the library to work.
I also tried to rasterize it: it also works great:
library(raster)
x <- raster("~.\\vgpm.2010001.hdf")
extent(x) <- extent(-180, 180, -90, 90)
crs(x) <- "+proj=longlat +datum=WGS84"
NAvalue(x) <- -9999
#plot(x)
f1<- as.data.frame(x, xy=TRUE)
(NB- I am very much a beginner in R.)
This is the code I tried:
read_xlsx("valid/url")
For some reason I get the error message:
'path' does not exist:'valid/url'
I know the URL works, I have tested it many times. I am mystified, so any help would be much appreciated.
If I understand your issue correctly, I think you are inputting the URL into the read_xlsx command. Far as I am aware, this will not work if your excel file is online, you will need to download it locally first.
I suggest the following adjustment:
url <- "valid/url"
temp <- tempfile()
download.file(url, temp, mode="wb")
df1 <- read_excel(path = temp)
This will download the excel file into a temporary file, which you can then read into a dataframe, since it will be saved locally.
I have a list of files in a directory which I'm trying to convert to csv, had tried rio package and solutions as suggested here
The output is list of empty CSV files with no content. It could be because the first 8 rows of the xls files have an image and few emtpy lines with couple couple of cells filled with text.
Is there any way I could skip those first 8 lines in all of xls files before converting.
Tried exploring options from openxlsx or readxls packages, any suggestions or guidance will be helpful.
Please do not mark as duplicate since I have a different problem than the one that was already answered
Maybe the following will work. At least it does for my own mock-up of an excel file with a picture in the top
library("readxl") # To read xlsx
library("readr") # Fast csv write
indata <- read_excel("~/cowexcel.xlsx", skip=8)
write_csv(indata, path="cow.csv")
If you are running this for several files then combine it into a function. Note that the function below does no checking and might overwrite existing csv files
convert_excel_to_csv <- function(name) {
indata <- read_excel(name, skip=8)
write_csv(indata, path=paste0(tools::file_path_sans_ext(name), ".csv"))
}
Although I was not able to do it with rio to convert, I read it as xls and wrote it back as csv using below code. Testing worked fine, Hope it works without glitch in implementation.
files <- list.files(pattern = '*.xls')
y=NULL
for(i in files ) {
x <- read.xlsx(i, sheetIndex = 1, header=TRUE, startRow=9)
y= rbind(y,x)
}
dt <- Sys.Date()
fn<- paste("path/",dt,".csv",sep="")
write.csv(y,fn,row.names = FALSE)
I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?
The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.
#EZGraphs on Twitter writes:
"Lots of online csvs are zipped. Is there a way to download, unzip the archive, and load the data to a data.frame using R? #Rstats"
I was also trying to do this today, but ended up just downloading the zip file manually.
I tried something like:
fileName <- "http://www.newcl.org/data/zipfiles/a1.zip"
con1 <- unz(fileName, filename="a1.dat", open = "r")
but I feel as if I'm a long way off.
Any thoughts?
Zip archives are actually more a 'filesystem' with content metadata etc. See help(unzip) for details. So to do what you sketch out above you need to
Create a temp. file name (eg tempfile())
Use download.file() to fetch the file into the temp. file
Use unz() to extract the target file from temp. file
Remove the temp file via unlink()
which in code (thanks for basic example, but this is simpler) looks like
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
data <- read.table(unz(temp, "a1.dat"))
unlink(temp)
Compressed (.z) or gzipped (.gz) or bzip2ed (.bz2) files are just the file and those you can read directly from a connection. So get the data provider to use that instead :)
Just for the record, I tried translating Dirk's answer into code :-P
temp <- tempfile()
download.file("http://www.newcl.org/data/zipfiles/a1.zip",temp)
con <- unz(temp, "a1.dat")
data <- matrix(scan(con),ncol=4,byrow=TRUE)
unlink(temp)
I used CRAN package "downloader" found at http://cran.r-project.org/web/packages/downloader/index.html . Much easier.
download(url, dest="dataset.zip", mode="wb")
unzip ("dataset.zip", exdir = "./")
For Mac (and I assume Linux)...
If the zip archive contains a single file, you can use the bash command funzip, in conjuction with fread from the data.table package:
library(data.table)
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | funzip")
In cases where the archive contains multiple files, you can use tar instead to extract a specific file to stdout:
dt <- fread("curl http://www.newcl.org/data/zipfiles/a1.zip | tar -xf- --to-stdout *a1.dat")
Here is an example that works for files which cannot be read in with the read.table function. This example reads a .xls file.
url <-"https://www1.toronto.ca/City_Of_Toronto/Information_Technology/Open_Data/Data_Sets/Assets/Files/fire_stns.zip"
temp <- tempfile()
temp2 <- tempfile()
download.file(url, temp)
unzip(zipfile = temp, exdir = temp2)
data <- read_xls(file.path(temp2, "fire station x_y.xls"))
unlink(c(temp, temp2))
To do this using data.table, I found that the following works. Unfortunately, the link does not work anymore, so I used a link for another data set.
library(data.table)
temp <- tempfile()
download.file("https://www.bls.gov/tus/special.requests/atusact_0315.zip", temp)
timeUse <- fread(unzip(temp, files = "atusact_0315.dat"))
rm(temp)
I know this is possible in a single line since you can pass bash scripts to fread, but I am not sure how to download a .zip file, extract, and pass a single file from that to fread.
Using library(archive) one can also read in a particular csv file within the archive, without having to UNZIP it first; read_csv(archive_read("http://www.newcl.org/data/zipfiles/a1.zip", file = 1), col_types = cols())
which I find more convenient & is faster.
It also supports all major archive formats & is quite a bit faster than the base R untar or unz - it supports tar, ZIP, 7-zip, RAR, CAB, gzip, bzip2, compress, lzma, xz & uuencoded files.
To unzip everything one can use archive_extract("http://www.newcl.org/data/zipfiles/a1.zip", dir=XXX)
This works on all platforms & given the superior performance for me would be the preferred option.
Try this code. It works for me:
unzip(zipfile="<directory and filename>",
exdir="<directory where the content will be extracted>")
Example:
unzip(zipfile="./data/Data.zip",exdir="./data")
rio() would be very suitable for this - it uses the file extension of a file name to determine what kind of file it is, so it will work with a large variety of file types. I've also used unzip() to list the file names within the zip file, so its not necessary to specify the file name(s) manually.
library(rio)
# create a temporary directory
td <- tempdir()
# create a temporary file
tf <- tempfile(tmpdir=td, fileext=".zip")
# download file from internet into temporary location
download.file("http://download.companieshouse.gov.uk/BasicCompanyData-part1.zip", tf)
# list zip archive
file_names <- unzip(tf, list=TRUE)
# extract files from zip file
unzip(tf, exdir=td, overwrite=TRUE)
# use when zip file has only one file
data <- import(file.path(td, file_names$Name[1]))
# use when zip file has multiple files
data_multiple <- lapply(file_names$Name, function(x) import(file.path(td, x)))
# delete the files and directories
unlink(td)
I found that the following worked for me. These steps come from BTD's YouTube video, Managing Zipfile's in R:
zip.url <- "url_address.zip"
dir <- getwd()
zip.file <- "file_name.zip"
zip.combine <- as.character(paste(dir, zip.file, sep = "/"))
download.file(zip.url, destfile = zip.combine)
unzip(zip.file)