I have a question about downloading files. I know how to download files, using the download.file function. I need to download multiple files from a particular site, each file corresponding to a different date. I have a series of dates, using which I can prepare the URL to download the file. I know for a fact that for some particular dates, the files are missing on the website. Subsequently my code stops at that point. I then have to manually reset the date index (increment it by 1) and re-run the code. Since I have to download more than 1500 files, I was wondering if I can somehow capture the 'absence of the file' and instead of the code stopping, it continues with the next date in the array.
Below is the dput of a part of the date array:
dput(head(fnames,10))
c("20060102.trd", "20060103.trd", "20060104.trd", "20060105.trd",
"20060106.trd", "20060109.trd", "20060110.trd", "20060112.trd",
"20060113.trd", "20060116.trd")
This file has 1723 dates. Below is the code that I am using:
for (i in 1:length(fnames)){
file <- paste(substr(fnames[i],7,8), substr(fnames[i],5,6), substr(fnames[i],1,4), sep = "")
URL <- paste("http://xxxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
The program works fine, till it encounters a particular date for which the file is missing, and it stops. Is there a way to capture this, and print the missing file name (the variable 'file'), and move on to the next date in the array?
Please help.
I apologize that I have not shared the exact URL. In case it becomes difficult to simulate the issue, then please let me know.
* Trying to incorporate #Paul's suggestion.
I worked on a smaller dataset.
dput(testnames) is
c("20120214.trd", "20120215.trd", "20120216.trd", "20120217.trd",
"20120221.trd")
I know that file corresponding to the date '20120216' is missing from the website. I altered my code to incorporate the tryCatch function. Below it is:
tryCatch({for (i in 1:length(testnames)){
file <- paste(substr(testnames[i],7,8), substr(testnames[i],5,6), substr(testnames[i],1,4), sep = "")
URL <- paste("http://xxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
},
error = function(e) {cat(file, '\n')
i=i+1},
warning = function(w) {message('cannot unzip')
i=i+1}
)
It runs fine for the first two dates, and as expected, throws an error for the 3rd one. I am facing 2 issues:
When I 'exclude' the warning block, it gives me the missing file name file as coded in the error block. But when I 'include' the warning block, it only issues the warning, and somehow doesnt execute the error block. Why is that?
In either case, the code stops after reading "20120216.trd" and doesnt proceed ahead with the next file, which is desirable. Is incrementing the variable i not sufficient for that purpose?
Please advise.
You can do this using tryCatch. This function will try the operation you feed it, and provide you with a way to dealing with errors. For example, in your case an error could simply lead to skipping the file and ignoring the error. For example:
skip_with_message = simpleError('Did not work out')
tryCatch(print(bla), error = function(e) skip_with_message)
# <simpleError: Did not work out>
Notice that the error here is that the bla object does not exist.
Related
I'm creating a simple little time-saver wrapper function that pre-fills out some standard file locations, etc. for importing an Excel file using readxl::read_xlsx. It works exactly as expected with the defaults, however, when I try to use it at the console with a different file location I get the following error.
Error in read_space_program(path = "inst/extdata/space_program.xlsx") :
unused argument (path = "inst/extdata/space_program.xlsx")
I've tried adding , ..., as suggested on StackOverflow by those with similar error messages, to extend the arguments but it does not fix the problem. This is the code I am running:
read_space_program <-
function(file_location = "inst/extdata/space_program.xlsx",
sheet_name = "Program",
skip_rows = 5, ...) {
readxl::read_xlsx(
path = file_location,
sheet = sheet_name,
col_names = TRUE,
skip = skip_rows
) # first five rows skipped to allow for project information
}
Without uploading the entire .xlsx file, suffice to say that I use this particular file all the time and it is not the source of the problem. It loads fine with this exact code when I run it like this: read_space_program(), however when I test it by feeding the exact same file location into it at the console with this: read_space_program(file_location = "inst/extdata/space_program.xlsx"), I get the error above. This error probably has something to do with something basic, I'm pretty sure, but cannot figure it out. Any help is appreciated.
This was caused by an artifact of development in my environment. Cleaning the environment allowed the code to run.
I am brand new to R and I am trying to run some existing code that should clean up an input .csv then save the cleaned data to a different location as a .RData file. This code has run fine for the previous owner.
The code seems to be pulling the .csv and cleaning it just fine. It also looks like the save is running (there are no errors) but there is no output in the specified location. I thought maybe R was having a difficult time finding the location, but it's pulling the input data okay and the destination is just a sub folder.
After a full day of extensive Googling, I can't find anything related to a save just not working.
Example code below:
save(data, file = "C:\\Users\\my_name\\Documents\\Project\\Data.RData", sep="")
Hard to believe you don't see any errors - unless something has switched errors off:
> data = 1:10
> save(data, file="output.RData", sep="")
Error in FUN(X[[i]], ...) : invalid first argument
Its a misleading error, the problem is the third argument, which doesn't do anything. Remove and it works:
> save(data, file="output.RData")
>
sep is used as an argument in writing CSV files to separate columns. save writes binary data which doesn't have rows and columns.
I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.
require(rvest)
require(stringr)
require(boilerpipeR)
# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"
content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA'
# next line induces fatal error
encoded_exit = read_html(content_html ,encoding = "UTF-8")
paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")
This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.
The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.
What could be causing this issue?
And how can I stop it from crashing the entire R session?
I had the same problem.
RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit.
The solution for me was to look at the URL I was scraping.
If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org.
In your case:
"Error: Start tag body seen but an element of the same type was
already open."
From line 107, column 1; to line 107, column 25
crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:
url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")
# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
print("Skip this Site")
}
# proceed with html_nodes(..) etc
rrscriptrvestsession-abortedweb-scraping
I searched the site thoroughly but didn't find an answer. The script looks into a file directory and appends into a vector of filenames (which I then turn into a dataframe). For is then used to iterate through custom function which consumes as argument the file name. The parser works fine with one file name (i.e. when I force it in parser(filename) otherwise fails when runs through the for loop). So I suspect it's how I'm passing the argument.
files <- list.files(path="~/pathname", pattern="*.xml", full.names = F, recursive=FALSE)
files.df <- as.data.frame(files)
for (i in 1:nrow(files.df)) {
parser(as.character(files.df[i,]))
}
When I print(as.character(files.df[i,]), the output is as expected. Again, running the custom parser with one file name (outside the for loop) as argument works perfectly. The setwd is the same as that of where the files are read from.
Anything you may have come across before?
Update: the script worked fine yesterday with one file but now I get the same error. It seems to happen at this line:
xmlfile=xmlParse(filename, useInternalNodes = TRUE)
This problem is similar to that seen here.
I have a large number of large CSVs which I am loading and parsing serially through a function. Many of these CSVs present no problem, but there are several which are causing problems when I try to load them with read.csv().
I have uploaded one of these files to a public Dropbox folder here (note that the file is around 10.4MB).
When I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and Rhelp for solutions. Maddeningly, when I run
Import <- read.csv("http://dl.dropbox.com/u/83576/Candidate%20Mentions.csv")
using the Dropbox URL instead of my local path, it loads, but when I then save that very data frame and try to reload it thus:
write.csv(Import, "Test_File.csv", row.names = F)
TestImport <- read.csv("Test_File.csv")
I get the "incomplete final line" warning again.
So, I am wondering why the Dropbox-loaded version works, while the local version does not, and how I can make my local versions work -- since I have somewhere around 400 of these files (and more every day), I can't use a solution that can't be automated in some way.
In a related problem, perhaps deserving of its own question, it appears that some "special characters" break the read.csv() process, and prevent the loading of the entire file. For example, one CSV which has 14,760 rows only loads 3,264 rows. The 3,264th row includes this eloquent Tweet:
"RT #akiron3: ácÎå23BkªÐÞ'q(#BarackObama )nĤÿükTPP ÍþnĤüÈ’áY‹ªÐÞĤÿüŽ
\&’ŸõWˆFSnĤ©’FhÎåšBkêÕ„kĤüÈLáUŒ~YÒhttp://t.co/ABNnWfTN
“jg)(WˆF"
Again, given the serialized loading of several hundred files, how can I (a) identify what is causing this break in the read.csv() process, and (b) fix the problem with code, rather than by hand?
Thanks so much for your help.
1)
suppressWarnings(TestImport <- read.csv("Test_File.csv") )
2) Unmatched quotes are the most common cause of apparent premature closure. You could try adding all of these:
quote="", na,strings="", comment.char=""