Download zip file to R when download link ends in '/download' - r

My issue is similar to this post, but the solution suggestion does not appear applicable.
I have a lot of zipped data stored an online server (B2Drop), that provides a download link with the extension "/download" instead of ".zip". I have been unable to get the method described here, to work.
I have created a test download page https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq, where the download link https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download can be obtained by right clicking the download button. Here is my script:
temp <- tempfile()
download.file("https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download",temp, mode="wb")
data <- read.table(unz(temp, "Test_file1.csv"))
unlink(temp)
When I run it, I get the error:
download.file("https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download",temp, mode="wb")
trying URL 'https://b2drop.eudat.eu/s/K9sPPjWz3jxtXEq/download'
Content type 'application/zip' length 558 bytes
downloaded 558 bytes
data <- read.table(unz(temp, "Test_file1.csv"))
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot locate file 'Test_file1.csv' in zip file 'C:\Users\User_name\AppData\Local\Temp\RtmpMZ6gXi\file3e881b1f230e'
which typically indicates a problem with the working directory where R is looking for the file. In this case that should be the temp wd.

Your internal path is wrong. You can use list=TRUE to list the files in the archive, analogous to the command-line utility's -l argument.
unzip(temp, list=TRUE)
# Name Length Date
# 1 Test/Test_file1.csv 256 2021-09-27 10:13:00
# 2 Test/Test_file2.csv 286 2021-09-27 10:14:00
Better than read.table, though, use read.csv since it's comma-delimited.
data <- read.csv(unz(temp, "Test/Test_file1.csv"))
head(data, 3)
# ID Variable1 Variable2 Variable Variable3
# 1 1 f 54654 25 t1
# 2 2 t 421 64 t2
# 3 3 x 4521 85 t3

Related

CSV file output not well aligned with "read.csv()"

I run the "R code" in "RKWard" to read a CSV file:
# I) Go to the working directory
setwd("/home/***/Desktop/***")
# II) Verify the current working directory
print(getwd())
# III) Load te nedded package
require("csv")
# IV) Read the desired file
read.csv(file="CSV_Example.csv", header=TRUE, sep=";")
The data in CSV file is as follow (an example token from this website):
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance
But the result is as follow:
id.name.salary.start_date.dept
1 1,Rick,623.3,2012-01-01,IT
2 2,Dan,515.2,2013-09-23,Operations
3 3,Michelle,611,2014-11-15,IT
4 4,Ryan,729,2014-05-11,HR
5 5,Gary,843.25,2015-03-27,Finance
6 6,Nina,578,2013-05-21,IT
7 7,Simon,632.8,2013-07-30,Operations
8 8,Guru,722.5,2014-06-17,Finance
PROBLEM: The datas are not aligned as supposed to be.
Please can anyone help me

Error in reading files in R

I'm a newcomer in the R community. Coding my first programs I've faced with a silly problem! When trying to read an RDS file with the following code:
tweets <- readRDS("RDataMining-Tweets-20160212.rds")
the following error will arise.
Error in gzfile(file, "rb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "rb") :
cannot open compressed file 'RDataMining-Tweets-20160212.rds', probable reason 'No such file or directory'
What's the problem here?
Since we don't have access to your file, it'll be difficult to really know for sure, so let me give you some examples of what other types files might give you.
First, some files:
ctypes <- list(FALSE, 'gzip', 'bzip2', 'xz')
saverds_names <- sprintf('saveRDS_%s.rds', ctypes)
save_names <- sprintf('save_%s.rda', ctypes)
ign <- mapply(function(fn,ct) saveRDS(mtcars, file=fn, compress=ct),
saverds_names, ctypes)
ign <- mapply(function(fn,ct) save(mtcars, file=fn, compress=ct),
save_names, ctypes)
str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "saveRDS_FALSE.rds: data"
# $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
# $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "save_FALSE.rda: data"
# $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
# $ : chr "save_xz.rda: XZ compressed data"
A common (unix-y) utility is file, which uses file signatures to determine probable file type. (If you are on windows, it is usually installed with Rtools, so look for it there. If Sys.which("file") is empty, then look around for where you have Rtools installed, for something like c:\Rtools\bin\file.exe.)
Sys.which('file')
# file
# "c:\\Rtools\\bin\\file.exe"
With this, let's see what file thinks these files are likely to be:
str(lapply(saverds_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "saveRDS_FALSE.rds: data"
# $ : chr "saveRDS_gzip.rds: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "saveRDS_bzip2.rds: bzip2 compressed data, block size = 900k"
# $ : chr "saveRDS_xz.rds: XZ compressed data"
str(lapply(save_names, function(fn) system2("file", fn, stdout=TRUE)))
# List of 4
# $ : chr "save_FALSE.rda: data"
# $ : chr "save_gzip.rda: gzip compressed data, from HPFS filesystem (OS/2, NT)"
# $ : chr "save_bzip2.rda: bzip2 compressed data, block size = 900k"
# $ : chr "save_xz.rda: XZ compressed data"
Helps a little. If your does not return one of these four strings, then you are likely looking at a corrupted file (or mis-named, i.e., not really an .rds format that we are expecting).
If it does return one of them, though, know that readRDS (the first four) and load (last four) will automatically determine the compress= argument to use, which means that the file is most likely corrupt (or some other form of compressed data; again, likely mis-named).
In contrast, some other file types return these:
system2("file", "blank.accdb")
# blank.accdb: raw G3 data, byte-padded
system2("file", "Book1.xlsx")
# Book1.xlsx: Zip archive data, at least v2.0 to extract
system2("file", "Book1.xls")
# Book1.xls: OLE 2 Compound Document
system2("file", "j.fra.R") # code for this answer
# j.fra.R: ASCII text, with CRLF line terminators
(The with CRLF is a windows-y thing. *sigh*) The last one also will also be the case for CSV and similar text-based tabular files, etc.
#divibisan's suggestion that the file could be corrupt is the most likely culprit, in my mind, but it might give different output:
file.size(saverds_names[1])
# [1] 3798
head(readRDS(rawConnection(readBin(saverds_names[1], what=raw(1)))), n=2)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
but incomplete data from truncated files looks different: I truncated the files (externally with dd), and the error I received as "Error in readRDS: error reading from connection\n".
Looking at the source for R, that error string is only present in R_gzread, suggesting that R thinks the file is compressed with "gzip" (which is the default, perhaps because it could not positively identify any other obvious compression method).
This isn't much of an answer, but it might give you some appreciation for what could be wrong. The bottom line, unfortunately, is that it is highly unlikely to be able to recover any data from a corrupted.
Try to open the file in this way, avoiding any kind of possible problem related to the file path:
tweets <- readRDS(file.choose())
An interactive window will be open and you will be able to select your file.
There are two rds files at the website that supports the book you are following:
http://www.rdatamining.com/data/
One of them is named: RDataMining-Tweets-20160203.rds and the other is named: RDataMining-Tweets-20160212.rds. I suspect you put them in a "downloads" folder using your browser and it happens to be different folder than the one that you will see if you execute:
getwd()
You should try re-downloading from that website and checking the file locations.
You can get a listing of files in your current working directory with:
list.files()
Make sure that the file name appears in that output.
If that proves difficult, then this also should succeed:
tweets2 <- readRDS( url("http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds?attredirects=0&d=1") )
str(tweets2) # should be a complex and long output
The url function creates a "connection" that can be used to download RDS files
First check your working directory by getwd(). Check whether the file you want to red in this directory or not. If not use setwd(dir=<#pathwhere you want to change it>).
i think you should not use readRDS in double quotes as extension .rds is already there. Instead use with file equivalent as readRDS(File="RDataMining-Tweets-20160212.rds")
Chuckle - I recognise that data file name...
You appear to be using the reference data from Yanchang Zhao's presentation on "Text Mining with R", the data set can be downloaded via the author's website [http://www.rdatamining.com/data] using this link - [RDataMining-Tweets-20160203.rds]
Looking at the error, as Devyani Balyan states the file you are trying to load is likely not in your working directory. The code examples generally put the data file in a <project>\data structure. Yanchang in his examples follows this heuristic.
My recommendation is you get a copy of a good R book to help get you started. My personal recommendation would be "The R Book" by Michael J. Crawley.
Here is a working code example, I hope it helps you progress:
Code Example
# used to load multiple R packages at once
# and or install them.
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
}
# Load twitterR Package
packages <- c("twitteR")
ipak(packages)
# Load Example dataset from the website
url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
destfile = "RDataMining-Tweets-20160212.rds"
# Download into the current directory for simplicity
# Normally I would recommend saving to <project>/data
# Check if file exists if not download it...
if (!file.exists(destfile)) {
download.file(url, destfile)
}
## load tweets into R
tweets <- readRDS(destfile)
head(tweets, n = 3)
Console output
> ipak <- function(pkg){
+ new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
+ if (length(new.pkg))
+ install.packages(new.pkg, dependencies = TRUE)
+ sapply(pkg, require, character.only = TRUE)
+ }
> # Load twitterR Package
> packages <- c("twitteR")
> ipak(packages)
twitteR
TRUE
> url <- "http://www.rdatamining.com/data/RDataMining-Tweets-20160212.rds"
> destfile = "RDataMining-Tweets-20160212.rds"
> # Check if file exists if not download it...
> if (!file.exists(destfile)) {
+ download.file(url, destfile)
+ }
> ## load tweets into R
> tweets <- readRDS(destfile)
> head(tweets, n = 3)
[[1]]
[1] "RDataMining: A Twitter dataset for text mining: #RDataMining Tweets extracted on 3 February 2016. Download it at **********"
[[2]]
[1] "RDataMining: Vacancy of Data Scientist – Healthcare Analytics, Adelaide, Australia\n***********"
[[3]]
[1] "RDataMining: Thanks to all for your ongoing support to ******** 3. Merry Christmas and Happy New Year!"
It seems your working directory is not properly set.
See which directory you downloaded the file in and copy the path.
Use the command setwd('Your/path/here') and then it should work. Just remember to change the orientation of your slashes (from '\' to '/').

Retrieving PubMed IDs and title: HTTP error

I have written a R code to fetch Pubmed IDs and title using the journal name, date, volume, issue and page number.
The file contains rows like
AAPS PharmSci 2000 2 1 E2
AAPS PharmSci 2004 6 1 1-9
And the output I want is like:
AAPS PharmSci 2000 2 1 E2, 11741218 , Molecular modeling of G-protein coupled receptor kinase 2: docking and biochemical evaluation of inhibitors.
similarly for all the rows in the file
The code I have written in R for this is
search_topic <- "search term"
search_query <- EUtilsSummary(search_topic)
#summary(search_query)
# see the ids of our returned query
ID <- QueryId(search_query)
# get actual data from PubMed
records<- EUtilsGet(search_query)
# store it
pubmed_data <- data.frame(ID,'Title'=ArticleTitle(records))
write.csv(pubmed_data, file = paste("./",search_topic,".csv",sep=""))
Which gives an error like:
In addition: Warning message:
In file(con, "r") : cannot open: HTTP status was '502 Server Hangup'
Please let me know where am I going wrong?

read.xls and url on Windows in R

I have seen many posts on here about using read.xls with a url and they all worked on my Mac, but now when I am trying to use the code on my Windows computer, it is not working. I used the below code on my Mac:
tmp <- tempfile()
download.file("https://www.spdrs.com/site-content/xls/SPY_All_Holdings.xls?fund=SPY&docname=All+Holdings&onyx_code1=1286&onyx_code2=1700", destfile = tmp, method = "curl")
SPY <- read.xls(tmp, skip=3)
unlink(tmp)
Using "curl" no longer woks ("had status 127" is the warning message) and when I try "internal" or "wininet", it says " formal argument "method" matched by multiple actual arguments". When I try read.xls, it says the file is "missing" and "invalid". I have downloaded Perl, Java, gdata, Rcurl and the "downloader" package (because I heard that works better with https) and could use that instead....Is there something else I would have to do on a Windows computer to make this code work?
Thanks!
> library(RCurl)
> library(XLConnect)
> URL <- "https://www.spdrs.com/site-content/xls/SPY_All_Holdings.xls?fund=SPY&docname=All+Holdings&onyx_code1=1286&onyx_code2=1700"
> f = CFILE("SPY_All_Holdings.xls", mode="wb")
> curlPerform(url = URL, writedata = f#ref, ssl.verifypeer = FALSE)
# OK
# 0
> close(f)
# An object of class "CFILE"
# Slot "ref":
# <pointer: (nil)>
> out <- readWorksheetFromFile(file = "SPY_All_Holdings.xls",sheet="SPY_All_Holdings")
> head(out)
# Fund.Name. SPDR..S.P.500..ETF Col3 Col4 Col5
# 1 Ticker Symbol: SPY <NA> <NA> <NA>
# 2 Holdings: As of 06/06/2016 <NA> <NA> <NA>
# 3 Name Identifier Weight Sector Shares Held
# 4 Apple Inc. AAPL 2.945380 Information Technology 54545070.000
# 5 Microsoft Corporation MSFT 2.220684 Information Technology 77807630.000
# 6 Exxon Mobil Corporation XOM 1.998224 Energy 40852760.000

Download a file from HTTPS using download.file()

I would like to read online data to R using download.file() as shown below.
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(URL, destfile = "./data/data.csv", method="curl")
Someone suggested to me that I add the line setInternet2(TRUE), but it still doesn't work.
The error I get is:
Warning messages:
1: running command 'curl "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv" -o "./data/data.csv"' had status 127
2: In download.file(URL, destfile = "./data/data.csv", method = "curl", :
download had nonzero exit status
Appreciate your help.
It might be easiest to try the RCurl package. Install the package and try the following:
# install.packages("RCurl")
library(RCurl)
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
## Or
## x <- getURL(URL, ssl.verifypeer = FALSE)
out <- read.csv(textConnection(x))
head(out[1:6])
# RT SERIALNO DIVISION PUMA REGION ST
# 1 H 186 8 700 4 16
# 2 H 306 8 700 4 16
# 3 H 395 8 100 4 16
# 4 H 506 8 700 4 16
# 5 H 835 8 800 4 16
# 6 H 989 8 700 4 16
dim(out)
# [1] 6496 188
download.file("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv",destfile="reviews.csv",method="libcurl")
Here's an update as of Nov 2014. I find that setting method='curl' did the trick for me (while method='auto', does not).
For example:
# does not work
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
destfile='localfile.zip')
# does not work. this appears to be the default anyway
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
destfile='localfile.zip', method='auto')
# works!
download.file(url='https://s3.amazonaws.com/tripdata/201307-citibike-tripdata.zip',
destfile='localfile.zip', method='curl')
I've succeed with the following code:
url = "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x = read.csv(file=url)
Note that I've changed the protocol from https to http, since the first one doesn't seem to be supported in R.
If using RCurl you get an SSL error on the GetURL() function then set these options before GetURL(). This will set the CurlSSL settings globally.
The extended code:
install.packages("RCurl")
library(RCurl)
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
Worked for me on Windows 7 64-bit using R3.1.0!
Offering the curl package as an alternative that I found to be reliable when extracting large files from an online database. In a recent project, I had to download 120 files from an online database and found it to half the transfer times and to be much more reliable than download.file.
#install.packages("curl")
library(curl)
#install.packages("RCurl")
library(RCurl)
ptm <- proc.time()
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- getURL(URL)
proc.time() - ptm
ptm
ptm1 <- proc.time()
curl_download(url =URL ,destfile="TEST.CSV",quiet=FALSE, mode="wb")
proc.time() - ptm1
ptm1
ptm2 <- proc.time()
y = download.file(URL, destfile = "./data/data.csv", method="curl")
proc.time() - ptm2
ptm2
In this case, rough timing on your URL showed no consistent difference in transfer times. In my application, using curl_download in a script to select and download 120 files from a website decreased my transfer times from 2000 seconds per file to 1000 seconds and increased the reliability from 50% to 2 failures in 120 files. The script is posted in my answer to a question I asked earlier, see .
Try following with heavy files
library(data.table)
URL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
x <- fread(URL)
127 means command not found
In your case, curl command was not found. Therefore it means, curl was not found.
You need to install/reinstall CURL. That's all. Get latest version for your OS from http://curl.haxx.se/download.html
Close RStudio before installation.
Had exactly the same problem as UseR (original question), I'm also using windows 7. I tried all proposed solutions and they didn't work.
I resolved the problem doing as follows:
Using RStudio instead of R console.
Actualising the version of R (from 3.1.0 to 3.1.1) so that the library RCurl runs OK on it. (I'm using now R3.1.1 32bit although my system is 64bit).
I typed the URL address as https (secure connection) and with / instead of backslashes \\.
Setting method = "auto".
It works for me now. You should see the message:
Content type 'text/csv; charset=utf-8' length 9294 bytes
opened URL
downloaded 9294 by
You can set global options and try-
options('download.file.method'='curl')
download.file(URL, destfile = "./data/data.csv", method="auto")
For issue refer to link-
https://stat.ethz.ch/pipermail/bioconductor/2011-February/037723.html
Downloading files through the httr-package also works:
URL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
httr::GET(URL,
httr::write_disk(path = basename(URL),
overwrite = TRUE))

Resources