How to skip missing files when downloading multiples files from the web?

How to skip missing files when downloading multiples files from the web? - r

I have a question about downloading files. I know how to download files, using the download.file function. I need to download multiple files from a particular site, each file corresponding to a different date. I have a series of dates, using which I can prepare the URL to download the file. I know for a fact that for some particular dates, the files are missing on the website. Subsequently my code stops at that point. I then have to manually reset the date index (increment it by 1) and re-run the code. Since I have to download more than 1500 files, I was wondering if I can somehow capture the 'absence of the file' and instead of the code stopping, it continues with the next date in the array.
Below is the dput of a part of the date array:
dput(head(fnames,10))
c("20060102.trd", "20060103.trd", "20060104.trd", "20060105.trd",
"20060106.trd", "20060109.trd", "20060110.trd", "20060112.trd",
"20060113.trd", "20060116.trd")
This file has 1723 dates. Below is the code that I am using:
for (i in 1:length(fnames)){
file <- paste(substr(fnames[i],7,8), substr(fnames[i],5,6), substr(fnames[i],1,4), sep = "")
URL <- paste("http://xxxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
The program works fine, till it encounters a particular date for which the file is missing, and it stops. Is there a way to capture this, and print the missing file name (the variable 'file'), and move on to the next date in the array?
Please help.
I apologize that I have not shared the exact URL. In case it becomes difficult to simulate the issue, then please let me know.
* Trying to incorporate #Paul's suggestion.
I worked on a smaller dataset.
dput(testnames) is
c("20120214.trd", "20120215.trd", "20120216.trd", "20120217.trd",
"20120221.trd")
I know that file corresponding to the date '20120216' is missing from the website. I altered my code to incorporate the tryCatch function. Below it is:
tryCatch({for (i in 1:length(testnames)){
file <- paste(substr(testnames[i],7,8), substr(testnames[i],5,6), substr(testnames[i],1,4), sep = "")
URL <- paste("http://xxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
},
error = function(e) {cat(file, '\n')
i=i+1},
warning = function(w) {message('cannot unzip')
i=i+1}
)
It runs fine for the first two dates, and as expected, throws an error for the 3rd one. I am facing 2 issues:
When I 'exclude' the warning block, it gives me the missing file name file as coded in the error block. But when I 'include' the warning block, it only issues the warning, and somehow doesnt execute the error block. Why is that?
In either case, the code stops after reading "20120216.trd" and doesnt proceed ahead with the next file, which is desirable. Is incrementing the variable i not sufficient for that purpose?
Please advise.

You can do this using tryCatch. This function will try the operation you feed it, and provide you with a way to dealing with errors. For example, in your case an error could simply lead to skipping the file and ignoring the error. For example:
skip_with_message = simpleError('Did not work out')
tryCatch(print(bla), error = function(e) skip_with_message)
# <simpleError: Did not work out>
Notice that the error here is that the bla object does not exist.

Related

Thin wrapper for read_xlsx function works with defaults, but not with custom file location

I'm creating a simple little time-saver wrapper function that pre-fills out some standard file locations, etc. for importing an Excel file using readxl::read_xlsx. It works exactly as expected with the defaults, however, when I try to use it at the console with a different file location I get the following error.
Error in read_space_program(path = "inst/extdata/space_program.xlsx") :
unused argument (path = "inst/extdata/space_program.xlsx")
I've tried adding , ..., as suggested on StackOverflow by those with similar error messages, to extend the arguments but it does not fix the problem. This is the code I am running:
read_space_program <-
function(file_location = "inst/extdata/space_program.xlsx",
sheet_name = "Program",
skip_rows = 5, ...) {
readxl::read_xlsx(
path = file_location,
sheet = sheet_name,
col_names = TRUE,
skip = skip_rows
) # first five rows skipped to allow for project information
}
Without uploading the entire .xlsx file, suffice to say that I use this particular file all the time and it is not the source of the problem. It loads fine with this exact code when I run it like this: read_space_program(), however when I test it by feeding the exact same file location into it at the console with this: read_space_program(file_location = "inst/extdata/space_program.xlsx"), I get the error above. This error probably has something to do with something basic, I'm pretty sure, but cannot figure it out. Any help is appreciated.

This was caused by an artifact of development in my environment. Cleaning the environment allowed the code to run.

R save() not producing any output but no error

I am brand new to R and I am trying to run some existing code that should clean up an input .csv then save the cleaned data to a different location as a .RData file. This code has run fine for the previous owner.
The code seems to be pulling the .csv and cleaning it just fine. It also looks like the save is running (there are no errors) but there is no output in the specified location. I thought maybe R was having a difficult time finding the location, but it's pulling the input data okay and the destination is just a sub folder.
After a full day of extensive Googling, I can't find anything related to a save just not working.
Example code below:
save(data, file = "C:\\Users\\my_name\\Documents\\Project\\Data.RData", sep="")

Hard to believe you don't see any errors - unless something has switched errors off:
> data = 1:10
> save(data, file="output.RData", sep="")
Error in FUN(X[[i]], ...) : invalid first argument
Its a misleading error, the problem is the third argument, which doesn't do anything. Remove and it works:
> save(data, file="output.RData")
>
sep is used as an argument in writing CSV files to separate columns. save writes binary data which doesn't have rows and columns.

read_html() induces fatal error in R session

I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.
require(rvest)
require(stringr)
require(boilerpipeR)
# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"
content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA'
# next line induces fatal error
encoded_exit = read_html(content_html ,encoding = "UTF-8")
paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")
This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.
The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.
What could be causing this issue?
And how can I stop it from crashing the entire R session?

I had the same problem.
RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit.
The solution for me was to look at the URL I was scraping.
If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org.
In your case:
"Error: Start tag body seen but an element of the same type was
already open."
From line 107, column 1; to line 107, column 25
crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:
url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")
# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
print("Skip this Site")
}
# proceed with html_nodes(..) etc
rrscriptrvestsession-abortedweb-scraping

Getting the following error in R: Error: 1: Start tag expected, '<' not found

I searched the site thoroughly but didn't find an answer. The script looks into a file directory and appends into a vector of filenames (which I then turn into a dataframe). For is then used to iterate through custom function which consumes as argument the file name. The parser works fine with one file name (i.e. when I force it in parser(filename) otherwise fails when runs through the for loop). So I suspect it's how I'm passing the argument.
files <- list.files(path="~/pathname", pattern="*.xml", full.names = F, recursive=FALSE)
files.df <- as.data.frame(files)
for (i in 1:nrow(files.df)) {
parser(as.character(files.df[i,]))
}
When I print(as.character(files.df[i,]), the output is as expected. Again, running the custom parser with one file name (outside the for loop) as argument works perfectly. The setwd is the same as that of where the files are read from.
Anything you may have come across before?
Update: the script worked fine yesterday with one file but now I get the same error. It seems to happen at this line:
xmlfile=xmlParse(filename, useInternalNodes = TRUE)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to skip missing files when downloading multiples files from the web? - r

Related

Thin wrapper for read_xlsx function works with defaults, but not with custom file location

R save() not producing any output but no error

read_html() induces fatal error in R session

Getting the following error in R: Error: 1: Start tag expected, '<' not found

More problems with "incomplete final line"

Categories

Resources