read_xlsx function fails to import online dataset with valid url - r

(NB- I am very much a beginner in R.)
This is the code I tried:
read_xlsx("valid/url")
For some reason I get the error message:
'path' does not exist:'valid/url'
I know the URL works, I have tested it many times. I am mystified, so any help would be much appreciated.

If I understand your issue correctly, I think you are inputting the URL into the read_xlsx command. Far as I am aware, this will not work if your excel file is online, you will need to download it locally first.
I suggest the following adjustment:
url <- "valid/url"
temp <- tempfile()
download.file(url, temp, mode="wb")
df1 <- read_excel(path = temp)
This will download the excel file into a temporary file, which you can then read into a dataframe, since it will be saved locally.

Related

Down load data from HTTPS in R

I have a problem with downloading data from HTTPS in R, I try using curl, but it doesn't work.
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
options('download.file.method'='curl')
download.file(URL, destfile = "./data.csv", method="auto")
I downloaded the CSV file with that code, but the format was changed when I checked the data. So it didn't download correctly.
Would you please someone help me?
I think you might actually have the URL wrong. I think you want:
https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv
Then you can download the file directly using library(RCurl) rather than creating a variable with the URL
library(RCurl)
download.file("https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv",destfile="./data.csv",method="libcurl")
You can also just load the file directly into R from the site using the following
URL <- "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
out <- read.csv(textConnection(URL))
You can use the 'raw.githubusercontent.com' link, i.e. in the browser, when you go to "https://github.com/Bitakhparsa/Capstone/blob/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv" you can click on the link "View raw" (it's above "Sorry about that, but we can’t show files that are this big right now.") and this takes you to the actual data. You also have some minor typos.
This worked as expected for me:
url <- "https://raw.githubusercontent.com/Bitakhparsa/Capstone/0850c8f65f74c58e45f6cdb2fc6d966e4c160a78/Plant_1_Generation_Data.csv"
download.file(url, destfile = "./data.csv", method="auto")
df <- read.csv("~/Desktop/data.csv")

r what doesn't curl_download like about a filename

I want to download some files (NetCDF although I don't think that matters) from a website
and write them to a specified data directory on my hard drive. Some code that illustrates my problems follows
library(curl)
baseURL <- "http://gsweb1vh2.umd.edu/LUH2/LUH2_v2f/"
fileChoice <- "IMAGE_SSP1_RCP19/multiple-states_input4MIPs_landState_ScenarioMIP_UofMD-IMAGE-ssp119-2-1-f_gn_2015-2100.nc"
destDir <- paste0(getwd(), "/data-raw/")
url <- paste0(baseURL, fileChoice)
destfile <- paste0(destDir, "test.nc")
curl_download(url, destfile) # this one works
destfile <- paste0(destDir, fileChoice)
curl_download(url, destfile) #this one fails
The error message is
Error in curl_download(url, destfile) :
Failed to open file /Users/gcn/Documents/workspace/landuse/data-raw/IMAGE_SSP1_RCP19/multiple-states_input4MIPs_landState_ScenarioMIP_UofMD-IMAGE-ssp119-2-1-f_gn_2015-2100.nc.curltmp.
It turns out the curl_download internally adds .curltmp to destfile and then removes it. I can't figure out what is writing
It turns out that the problem is the fileChoice variable includes a new directory; IMAGE_SSP1_RCP19. Once I created the directory the process worked fine. I'm posting this because someone else might make the same mistake I did.

Reading in Excel (downloaded with automated script) produces error when not manually opened and saved first

I run an automated script to download 3 .xls files from 3 websites every hour. When I later try to read in the .xls files in R to further work with them, R produces the following error message:
"Error: IOException (Java): block[ 2 ] already removed - does your POIFS have circular or duplicate block references?"
When I manually open and save the .xls files this problem doesn't appear anymore and everything works normal, but since the total number of files is increasing with 72 every day this is not a nice work around.
The script I use to download and save the files:
library(httr)
setwd("WORKDIRECTION")
orig_wd <- getwd()
FOLDERS <- c("NAME1","NAME2","NAME3") #representing folder names
LINKS <- c("WEBSITE_1", #the urls from which I download
"WEBSITE_2",
"WEBSITE_3")
NO <- length(FOLDERS)
for(i in 1:NO){
today <- as.character(Sys.Date())
if (!file.exists(paste(FOLDERS[i],today,sep="/"))){
dir.create(paste(FOLDERS[i],today,sep="/"))
}
setwd(paste(orig_wd,FOLDERS[i],today,sep="/"))
dat<-GET(LINKS[i])
bin <- content(dat,"raw")
now <- as.character(format(Sys.time(),"%X"))
now <- gsub(":",".",now)
writeBin(bin,paste(now,".xls",sep=""))
setwd(orig_wd)
}
I then read in the files with the following script:
require(gdata)
require(XLConnect)
require(xlsReadWrite)
wb = loadWorkbook("FILEPATH")
df = readWorksheet(wb, "Favourite List" , header = FALSE)
Does anybody have experience with this type of error, and knows a solution or workaround?
The problem is partly resolved by using the readxl package available in the CRAN library. After installation files can be read in with:
library(readxl)
read_excel("PathToFile")
The only problem is, that the last column is omitted while reading in. If I find a solution for this I'll update the awnser.

R read.dta and unz not working

I read a lot of files into R from zipped sources. I try to use the R function unz to read from zipped files because unlike unzip it does not leave any unzipped files on my harddisk.
However, this does not seem to work for zipped *.dta (Stata) files:
library(foreign)
temp <- tempfile()
download.file("http://databank.worldbank.org/data/download/WDI_csv.zip", temp)
wdi_unz <- read.csv(unz(temp, "WDI_Data.csv"))
unlink(temp)
temp <- tempfile()
download.file("http://www.rug.nl/research/ggdc/data/pwt/v80/pwt80.zip",temp)
pwt_unzip <- read.dta(unzip(temp, "pwt80.dta"))
pwt_unz <- read.dta(unz(temp, "pwt80.dta"))
unlink(temp)
Sorry for using the rather large World Development Indicators database (its 40+ MB), but I did not find any better working example.
The code produces an error when reading pwt_unz, [edit: but not when reading pwt_unzip]. What is the problem there? Probably it has something to do with the return value of unz not being compatible with the input for read.dta?
I think you need read.dta
Have a look here :
http://stat.ethz.ch/R-manual/R-devel/library/foreign/html/read.dta.html

How to download an .xlsx file from a dropbox (https:) location

I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.

Resources