Problems with Downloading pdf file using R - r

I would like to download a pdf file from the internet and save it in the local HD. After download, the pdf output file has lots of empty pages. What can I do to fix it?
Example:
require(XML)
url <- ('http://cran.r-project.org/doc/manuals/R-intro.pdf')
download.file(url, 'introductionToR.pdf')
Thanks in advance.

Try with wb-mode like this:
download.file(url, 'introductionToR.pdf', mode="wb").
For me it works that way.

you can download pdfs and export tables as data.frame using tabulizer package
https://ropensci.org/tutorials/tabulizer_tutorial.html
install.packages("devtools")
# on 64-bit Windows
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"), INSTALL_opts = "--no-multiarch")
# elsewhere
ghit::install_github(c("ropenscilabs/tabulizerjars", "ropenscilabs/tabulizer"))
library(tabulizer)
f2 <- "https://github.com/leeper/tabulizer/raw/master/inst/examples/data.pdf"
extract_tables(f2, pages = 1, method = "data.frame")

Related

Passed a filename that is NOT a string of characters! (RMarkdown)

I'm accessing ncdf files directly from a website [here][1] into my RMarkdown.
When I try to read the file using the nc_open functions as in the code below, I get the error 'Passed a filename that is NOT a string of characters!'
Any idea how I can solve this?
ps: I even tried uncompressing the files with the gzcon function but the result is the same when I try to read the data.
Thanks for your help!
Kami
library(httr)
library(ncdf4)
nc<-GET("https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.05/cruts.2103051243.v4.05/pre/cru_ts4.05.2011.2020.pre.dat.nc.gz")
cru_nc<-nc_open(nc)
OK here is the fill answer:
library(httr)
library(ncdf4)
library(R.utils)
url <- "https://crudata.uea.ac.uk/cru/data/hrg/cru_ts_4.05/cruts.2103051243.v4.05/pre/cru_ts4.05.2011.2020.pre.dat.nc.gz"
filename <- "/tmp/file.nc.gz"
# Download the file and store it as a temp file
download.file(url, filename, mode = "wb")
# Unzip the temp file
gunzip(filename)
# The unzipped filename drops the .gz
unzip_filename <- "/tmp/file.nc"
# You can now open the unzipped file with its **filename** rather than the object
cru_nc<-nc_open(unzip_filename)
Is this a mode="w" Vs mode="wb" issue. I've had this with files before. No experience of ncdf4.
Not sure if you can pass mode="wb" to get but does
file.download(yourUrl, mode="wb")
Work / help
Edit:
Ah. Other thing is you are storing the object as an object (nc) but nc_open wants to open a file.
I think you need to save the object locally (unless nc_open can just take the URL) and then open it? Possibly after unzipping.

How do I read excel file through URL in R studio? Its https

I have a secure Url which provides data in Excel format. How do I read it in R studio?
Please mention the necessary package and functions. I have tried read.xls(),
read_xlsx, read.URL and some more. Nothing seems to work.
You can do it in two steps. First, you'll need to download it with something like download.file, then read it with readxl::read_excel
download.file("https://file-examples.com/wp-content/uploads/2017/02/file_example_XLS_10.xls", destfile = "/tmp/file.xls")
readxl::read_excel("/tmp/file.xls")
library(readxl)
library(httr)
url<-'https://......xls'
GET(url, write_disk(TF <- tempfile(fileext = ".xls")))
read_excel(TF)
Have you tried importing it as a .csv dataset into RStudio? Might be worth a try!:)

Downloading file using R with default name [duplicate]

I need to download a file, save it in a folder while keeping the original filename from the website.
url <- "http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
From a web browser, if you click on that link, you get to download an excel file with this filename:
AfiliadosMuni-02-2015.xlsx
I know I can easily download it with the command download.file in R like this:
download.file(url, "test.xlsx", method = "curl")
But what I really need for my script is to download it keeping the original filename intact. I also know I can do this with curl from my console like this.
curl -O -J $"http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1"
But, again, I need this within an R script. Is there a way similar to the one above but in R? I have looked into the RCurl package but I couldn't find a solution.
You could always do something like:
library(httr)
library(stringr)
# alternate way to "download.file"
fil <- GET("http://www.seg-social.es/prdi00/idcplg?IdcService=GET_FILE&dID=187112&dDocName=197533&allowInterrupt=1",
write_disk("tmp.fil"))
# get what name the site suggests it shld be
fname <- str_match(headers(fil)$`content-disposition`, "\"(.*)\"")[2]
# rename
file.rename("tmp.fil", fname)
I think basename() would be the simplest option https://www.rdocumentation.org/packages/base/versions/3.4.3/topics/basename
e.g.
download.file(url, basename(url))

trying to use fread() on .csv file but getting internal error "ch>eof"

I am getting an error from fread:
Internal error: ch>eof when detecting eol
when trying to read a csv file downloaded from an https server, using R 3.2.0. I found something related on Github, https://github.com/Rdatatable/data.table/blob/master/src/fread.c, but don't know how I could use this, if at all. Thanks for any help.
Added info: the data was downloaded from here:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
then I used
download.file(fileURL, "Idaho2006.csv", method = "Internal")
The problem is that download.file doesn't work with https with method=internal unless you're on Windows and set an option. Since fread uses download.file when you pass it a URL and not a local file, it'll fail. You have to download the file manually then open it from a local file.
If you're on Linux or have either of the following already then do method=wget or method=curl instead
If you're on Windows and don't have either and don't want to download them then do setInternet2(use = TRUE) before your download.file
http://www.inside-r.org/r-doc/utils/setInternet2
For example:
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
download.file(fileURL, tempf, method = "curl")
DT <- fread(tempf)
unlink(tempf)
Or
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
tempf <- tempfile()
setInternet2 = TRUE
download.file(fileURL, tempf)
DT <- fread(tempf)
unlink(tempf)
fread() now utilises curl package for downloading files. And this seems to work just fine atm:
require(data.table) # v1.9.6+
fread(fileURL, showProgress = FALSE)
The easiest way to fix this problem in my experience is to just remove the s from https. Also remove the method you don't need it. My OS is Windows and i have tried the following code and works.
fileURL <- "http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(fileURL, "Idaho2006.csv")

How to download an .xlsx file from a dropbox (https:) location

I'm trying to adopt the Reproducible Research paradigm but meet people who like looking at Excel rather than text data files half way, by using Dropbox to host Excel files which I can then access using the .xlsx package.
Rather like downloading and unpacking a zipped file I assumed something like the following would work:
# Prerequisites
require("xlsx")
require("ggplot2")
require("repmis")
require("devtools")
require("RCurl")
# Downloading data from Dropbox location
link <- paste0(
"https://www.dropbox.com/s/",
"{THE SHA-1 KEY}",
"{THE FILE NAME}"
)
url <- getURL(link)
temp <- tempfile()
download.file(url, temp)
However, I get Error in download.file(url, temp) : unsupported URL scheme
Is there an alternative to download.file that will accept this URL scheme?
Thanks,
Jon
You have the wrong URL - the one you are using just goes to the landing page. I think the actual download URL is different, I managed to get it sort of working using the below.
I actually don't think you need to use RCurl or the getURL() function, and I think you were leaving out some relatively important /'s in your previous formulation.
Try the following:
link <- paste("https://dl.dropboxusercontent.com/s",
"{THE SHA-1 KEY}",
"{THE FILE NAME}",
sep="/")
download.file(url=link,destfile="your.destination.xlsx")
closeAllConnections()
UPDATE:
I just realised there is a source_XlsxData function in the repmis package, which in theory should do the job perfectly.
Also the function below works some of the time but not others, and appears to get stuck at the GET line. So, a better solution would be very welcome.
I decided to try taking a step back and figure out how to download a raw file from a secure (https) url. I adapted (butchered?) the source_url function in devtools to produce the following:
download_file_url <- function (
url,
outfile,
..., sha1 = NULL)
{
require(RCurl)
require(devtools)
require(repmis)
require(httr)
require(digest)
stopifnot(is.character(url), length(url) == 1)
filetag <- file(outfile, "wb")
request <- GET(url)
stop_for_status(request)
writeBin(content(request, type = "raw"), filetag)
close(filetag)
}
This seems to work for producing local versions of binary files - Excel included. Nicer, neater, smarter improvements in this gratefully received.

Resources