Downloading multiple files using purrr - r

I am trying to download all of the Excel files at: https://www.grants.gov.au/reports/gaweeklyexport
Using the code below I get errors similar to the following for each link (77 in total):
[[1]]$error
<simpleError in download.file(.x, .y, mode = "wb"): scheme not supported in URL '/Reports/GaWeeklyExportDownload?GaWeeklyExportUuid=0db183a2-11c6-42f8-bf52-379aafe0d21b'>
I get this error when trying to iterate over the full list, but when I call download.file on an individual list item it works fine.
I would be grateful if someone could tell me what I have done wrong or suggest a better way of doing it.
The code that produces the error:
library(tidyverse)
library(rvest)
# Reading links to the Excel files to be downloaded
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- html_attr(html_nodes(webpage, '.u'), "href")
# Creating names for the files to supply to the download.file function
wb_names = str_c(1:77, ".xlsx")
# Defining a function that using purrr's safely to ensure it doesn't fail if there is a dead link
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Combining links, file names, and the function returns an error
map2(links, wb_names, safe_download)

You need to prepend 'https://www.grants.gov.au/' to the URL to get the absolute path of the file which can be used to download the file.
library(rvest)
library(purrr)
url <- "https://www.grants.gov.au/reports/gaweeklyexport"
webpage <- read_html(url)
# The list of links to the Excel files
links <- paste0('https://www.grants.gov.au/', html_attr(html_nodes(webpage, '.u'), "href"))
safe_download <- safely(~ download.file(.x , .y, mode = "wb"))
# Creating names for the files to supply to the download.file function
wb_names = paste0(1:77, ".xlsx")
map2(links, wb_names, safe_download)

Related

Parsing issue, unexpected character when loading a folder

I am using this answer to load in a folder of Excel Files:
# Get the list of files
#----------------------------#
folder <- "path/to/files"
fileList <- dir(folder, recursive=TRUE) # grep through these, if you are not loading them all
# use platform appropriate separator
files <- paste(folder, fileList, sep=.Platform$file.sep)
So far, so good.
# Load them in
#----------------------------#
# Method 1:
invisible(sapply(files, source, local=TRUE))
#-- OR --#
# Method 2:
sapply(files, function(f) eval(parse(text=f)))
But the source function (Method 1) gives me the error:
Error in source("C:/Users/Username/filename.xlsx") :
C:/Users/filename :1:3: unexpected input
1: PK
^
For method 2 get the error:
Error in parse(text = f) : <text>:1:3: unexpected '/'
1: C:/
^
EDIT: I tried circumventing the issue by setting the working directory to the directory of the folder, but that did not help.
Any ideas why this happens?
EDIT 2: It works when doing the following:
How can I read multiple (excel) files into R?
setwd("...")
library(readxl)
file.list <- list.files(pattern='*.xlsx')
df.list <- lapply(file.list, read_excel)
just to provide a proper answer outside of the comment section...
If your target is to read many Excel files, you shouldn't use source.
source is dedicated to run external R code.
If you need to read many Excel files you can use the following code and the support of one of these libraries: readxl, openxlsx, tidyxl (with unpivotr).
filelist <- dir(folder, recursive = TRUE, full.names = TRUE, pattern = ".xlsx$|.xls$", ignore.case = TRUE)
l_df <- lapply(filelist, readxl::read_excel)
Note that we are using dir to list the full paths (full.names = TRUE) of all the files that ends with .xlsx, .xls (pattern = ".xlsx$|.xls$"), .XLSX, .XLS (ignore.case = TRUE) in the folder folder and all its subfolders (recursive = TRUE).
readxl is integrated with tidyverse. It is pretty easy to use. It is most likely what you're looking for.
Personally, I advice to use openxlsx if you need to write (rather than read) customized Excel files with many specific features.
tidyxl is the best package I've seen to read Excel files, but it may be rather complicated to use. However, it's really careful in the types preservation.
With the support of unpivotr it allows you to handle complicated Excel structures.
For example, when you find multiple headers and multiple left index columns.

how to download pdf file with R from web (encode issue)

I am trying to download a pdf file from a website using R. When I tried to to use the function browserURL, it only worked with the argument encodeIfNeeded = T. As a result, if I pass the same url to the function download.file, it returns "cannot open destfile 'downloaded/teste.pdf', reason 'No such file or directory", i.e., it cant find the correct url.
How do I correct the encode, in order for me to be able to download the file programatically?
I need to automate this, because there are more than a thousand files to download.
Here's a minimum reproducible code:
library(tidyverse)
library(rvest)
url <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html"
webpage <- read_html(url)
# scrapping hyperlinks
links_decisoes <- html_nodes(webpage,".borderTD a") %>%
html_attr("href")
# creating full/correct url
full_links <- paste("http://www.ouvidoriageral.sp.gov.br/", links_decisoes, sep="" )
# browseURL only works with encodeIfNeeded = T
browseURL(full_links[1], encodeIfNeeded = T,
browser = "C://Program Files//Mozilla Firefox//firefox.exe")
# returns an error
download.file(full_links[1], "downloaded/teste.pdf")
There are a couple of problems here. Firstly, the links to some of the files are not properly formatted as urls - they contain spaces and other special characters. In order to convert them you must use url_escape(), which should be available to you as loading rvest also loads xml2, which contains url_escape().
Secondly, the path you are saving to is relative to your R home directory, but you are not telling R this. You either need the full path like this: "C://Users/Manoel/Documents/downloaded/testes.pdf", or a relative path like this: path.expand("~/downloaded/testes.pdf").
This code should do what you need:
library(tidyverse)
library(rvest)
# scraping hyperlinks
full_links <- "http://www.ouvidoriageral.sp.gov.br/decisoesLAI.html" %>%
read_html() %>%
html_nodes(".borderTD a") %>%
html_attr("href") %>%
url_escape() %>%
{paste0("http://www.ouvidoriageral.sp.gov.br/", .)}
# Looks at page in firefox
browseURL(full_links[1], encodeIfNeeded = T, browser = "firefox.exe")
# Saves pdf to "downloaded" folder if it exists
download.file(full_links[1], path.expand("~/downloaded/teste.pdf"))

R issues vs excel

I am working on a project where I have multiple excel files and each file is having multiple workbook, I have to get the data from one of the workbook let say sheet=6, and after that all theses data to store in a new .xls or .csv file.
I am facing issue while trying to read the data from file and string it in to list. getting the following error:
Error: `path` does not exist: ‘BillingReport___Gurgaon-Apr-2019.xlsx’
I am trying mapdfr funtion ot get the data.
library(purrr)
library(readxl)
library(dplyr)
library(rio)
library(XLConnect)
library(tidyverse)
setwd ="F:/Capstone/Billing Reports final/"
#Set path of Billing source folder
billingptah <- "F:/Capstone/Billing Reports final/"
#Set path of destination folder
csvexportpath <- "F:/Capstone/Billing_data/billing_data.csv"
#get the names of the files to be loaded
files_to_load <- list.files(path = billingptah)
files_to_load
#Load all the data into one file
billing_data <- map_dfr(files_to_load, function(x) map_dfr( excel_sheets(x) , function(y) read_excel(path=x, sheet = 6,col_types = "text" ) %>% mutate(sheet=6) ) %>% mutate(filename=x) )
following is the error message:
Error: `path` does not exist:
‘BillingReport___Gurgaon-Apr-2019.xlsx’
It is all about the difference about relative and absolute path. You're telling R to load a file located in your current working directory named ‘BillingReport___Gurgaon-Apr-2019.xlsx’. You need to add the path to access this file name as a suffix. Try this after building files_to_load:
files_to_load <- paste0(billingptah, files_to_load)
It will tell R to access the files named after file_to_load located in billingptah directory.
Edit
Let me just point you out some useful links:
https://www.reed.edu/data-at-reed/resources/R/reading_and_writing.html
And for best practices: https://stat.ethz.ch/R-manual/R-devel/library/base/html/file.path.html

Reading pdf with TM package

I am trying to read pdf files with the TM package. I have gone through succesfully in most of the attempts, but one. I have several folders with hundreds of documents each. I have read all of them but one. The problem is that the pdfs in that specific folder have a sequence of images on the bottom of the first page that prevents me from reading them. I get the following error:
Error in strptime(d, fmt) : input string is too long
If I remove the first page, I manage to read them. I could do it without much loss of relevant information, but it is too much work.
I try with xpdf and ghoststring, but both give me the same error.
My code is as following:
library(rvest)
library(tm)
url<-paste0("http://www.tjrj.jus.br/search?q=acidente+de+transito+crianca+atropelamento&btnG=Pesquisar&processType=cnj&site=juris&client=juris&output=xml_no_dtd&proxystylesheet=juris&entqrm=0&oe=UTF-8&ie=UTF-8&ud=1&filter=0&getfields=*&partialfields=(ctd:1)&exclude_apps=1&ulang=en&lr=lang_pt&sort=date:D:S:d1&as_q=+&access=p&entqr=3&start=",seq(0,462,10))
css<-sprintf(".margin-top-10:nth-child(%.d) .outros .featured",1:10)
for (j in 1:1){ # There 47 pages, but I only put one here
for (i in 1:10){ # there are 10 files per page.
a<-html_node(css=css[i]) %>%
html_attr("href")
download.file(a,paste0("doc",j,i,".pdf"))
}
}
files <- list.files(pattern = "pdf$")
Rpdf <- readPDF(control = list(text = "-layout"))
docs <- Corpus(URISource(files,encoding="UTF-8"),readerControl = list(reader = Rpdf,language="portuguese"))
Does someone have a suggestion? I use a Mac.
Late answer:
But I recently discovered that with the current verions of tm (0.7-4) readPDF uses pdftools as default to read pdfs.
library(tm)
directory <- getwd() # change this to directory where pdf-files are located
# read the pdfs with readPDF, default engine used is pdftools see ?readPDF for more info
my_corpus <- VCorpus(DirSource(directory, pattern = ".pdf"),
readerControl = list(reader = readPDF))

Read list of file names from web into R

I am trying to read a lot of csv files into R from a website. Threa are multiple years of daily (business days only) files. All of the files have the same data structure. I can sucessfully read one file using the following logic:
# enter user credentials
user <- "JohnDoe"
password <- "SecretPassword"
credentials <- paste(user,":",password,"#",sep="")
web.site <- "downloads.theice.com/Settlement_Reports_CSV/Power/"
# construct path to data
path <- paste("https://", credentials, web.site, sep="")
# read data for 4/10/2013
file <- "icecleared_power_2013_04_10"
fname <- paste(path,file,".dat",sep="")
df <- read.csv(fname,header=TRUE, sep="|",as.is=TRUE)
However, Im looking for tips on how to read all the files in the directory at once. I suppose I could generate a sequence of dates an construct the file name above in a loop and use rbind to append each file but that seems cumbersome. Plus there will be issues when attempting to read weekends and holidays where there is no files.
The impages below show what the list of files look like in the web browser:
...
...
...
Is there a way to scan the path (from above) to get a list of all the file names in the directory first that meet a certin crieteia (i.e. start with "icecleared_power_" as there are also some files in that location that have a different starting name that I do not want to read in) then loop the read.csv through that list and use rbind to append?
Any guidance would be greatly appreciated?
I would first try to just scrape the links to the relevant data files and use the resulting information to construct the full download path that includes user logins and so on. As others have suggested, lapply would be convenient for batch downloading.
Here's an easy way to extract the URLs. Obviously, modify the example to suit your actual scenario.
Here, we're going to use the XML package to identify all the links available at the CRAN archives for the Amelia package (http://cran.r-project.org/src/contrib/Archive/Amelia/).
> library(XML)
> url <- "http://cran.r-project.org/src/contrib/Archive/Amelia/"
> doc <- htmlParse(url)
> links <- xpathSApply(doc, "//a/#href")
> free(doc)
> links
href href href
"?C=N;O=D" "?C=M;O=A" "?C=S;O=A"
href href href
"?C=D;O=A" "/src/contrib/Archive/" "Amelia_1.1-23.tar.gz"
href href href
"Amelia_1.1-29.tar.gz" "Amelia_1.1-30.tar.gz" "Amelia_1.1-32.tar.gz"
href href href
"Amelia_1.1-33.tar.gz" "Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz"
href href href
"Amelia_1.2-2.tar.gz" "Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz"
href href href
"Amelia_1.2-13.tar.gz" "Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz"
href href href
"Amelia_1.2-16.tar.gz" "Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
href href href
"Amelia_1.5-4.tar.gz" "Amelia_1.5-5.tar.gz" "Amelia_1.6.1.tar.gz"
href href href
"Amelia_1.6.3.tar.gz" "Amelia_1.6.4.tar.gz" "Amelia_1.7.tar.gz"
For the sake of demonstration, imagine that, ultimately, we only want the links for the 1.2 versions of the package.
> wanted <- links[grepl("Amelia_1\\.2.*", links)]
> wanted
href href href
"Amelia_1.2-0.tar.gz" "Amelia_1.2-1.tar.gz" "Amelia_1.2-2.tar.gz"
href href href
"Amelia_1.2-9.tar.gz" "Amelia_1.2-12.tar.gz" "Amelia_1.2-13.tar.gz"
href href href
"Amelia_1.2-14.tar.gz" "Amelia_1.2-15.tar.gz" "Amelia_1.2-16.tar.gz"
href href
"Amelia_1.2-17.tar.gz" "Amelia_1.2-18.tar.gz"
You can now use that vector as follows:
wanted <- links[grepl("Amelia_1\\.2.*", links)]
GetMe <- paste(url, wanted, sep = "")
lapply(seq_along(GetMe),
function(x) download.file(GetMe[x], wanted[x], mode = "wb"))
Update (to clarify your question in comments)
The last step in the example above downloads the specified files to your current working directory (use getwd() to verify where that is). If, instead, you know for sure that read.csv works on the data, you can also try to modify your anonymous function to read the files directly:
lapply(seq_along(GetMe),
function(x) read.csv(GetMe[x], header = TRUE, sep = "|", as.is = TRUE))
However, I think a safer approach might be to download all the files into a single directory first, and then use read.delim or read.csv or whatever works to read in the data, similar to as was suggested by #Andreas. I say safer because it gives you more flexibility in case files aren't fully downloaded and so on. In that case, instead of having to redownload everything, you would only need to download the files which were not fully downloaded.
#MikeTP, if all the reports start with "icecleared_power_" and a date which is a business date the package "timeDate" offers an easy way to create a vector of business dates, like so:
require(timeDate)
tSeq <- timeSequence("2012-01-01","2012-12-31") # vector of days
tBiz <- tSeq[isBizday(tSeq)] # vector of business days
and
paste0("icecleared_power_",as.character.Date(tBiz))
gives you the concatenated file name.
If the web site follows a different logic regarding the naming of files we need more information as Ananda Mahto observed.
Keep in mind that when you create a date vector with timeDate you can get much more sophisticated then my simple example. You can take into account holiday schedules, stock exchange dates etc.
You can try using the command "download.file".
### set up the path and destination
path <- "url where file is located"
dest <- "where on your hard disk you want the file saved"
### Ask R to try really hard to download your ".csv"
try(download.file(path, dest))
The trick to this is going to be figuring out how the "url" or "path" changes systematically between files. Often times, web pages are built such that the "url's" are systematic. In this case, you could potentially create a vector or data frame of url's to iterate over inside of an apply function.
All of this can be sandwiched inside of an "lapply". The "data" object is simply whatever we are iterating over. It could be a vector of URL's or a data frame of year and month observations, which could then be used to create URL's within the "lapply" function.
### "dl" will apply a function to every element in our vector "data"
# It will also help keep track of files which have no download data
dl <- lapply(data, function(x) {
path <- 'url'
dest <- './data_intermediate/...'
try(download.file(path, dest))
})
### Assign element names to your list "dl"
names(dl) <- unique(data$name)
index <- sapply(dl, is.null)
### Figure out which downloads returned nothing
no.download <- names(dl)[index]
You can then use "list.files()" to merge all data together, assuming they belong in one data.frame
### Create a list of files you want to merge together
files <- list.files()
### Create a list of data.frames by reading each file into memory
data <- lapply(files, read.csv)
### Stack data together
data <- do.call(rbind, data)
Sometimes, you will notice the file has been corrupted after downloading. In this case, pay attention to the option contained within the download.file() command, "mode". You can set mode = "w" or mode = "wb" if the file is stored in a binary format.
Another solution
for( i in seq_along(stations)) {
URL<-""
files<-rvest::read_html(URL) %>% html_nodes("a") %>% html_text(trim = T)
files<-grep(stations[i],files ,ignore.case = TRUE, value = TRUE )
destfile<- paste0("C:/Users/...",files)
download.file(URL,destfile, mode="wb")
}

Resources