I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.
require(rvest)
require(stringr)
require(boilerpipeR)
# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"
content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA'
# next line induces fatal error
encoded_exit = read_html(content_html ,encoding = "UTF-8")
paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")
This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.
The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.
What could be causing this issue?
And how can I stop it from crashing the entire R session?
I had the same problem.
RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit.
The solution for me was to look at the URL I was scraping.
If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org.
In your case:
"Error: Start tag body seen but an element of the same type was
already open."
From line 107, column 1; to line 107, column 25
crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:
url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")
# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
print("Skip this Site")
}
# proceed with html_nodes(..) etc
rrscriptrvestsession-abortedweb-scraping
Related
I'm trying to write a scraper that goes through a list of pages (all from the same site) and either 1. downloads the html/css from each page, or 2. gets me the links which exist within a list item with a particular class. (For now, my code reflects the former.) I'm doing this in R; python returned a 403 error upon the very first get request of the site, so BeautifulSoup and selenium were ruled out. In R, my code works for a time (a rather short one), and then I receive a 403 error, specifically:
"Error in open.connection(x, "rb") : HTTP error 403."
I considered putting a Sys.sleep() timer on each item in the loop, but I need to run this nearly 1000 times, so I found that solution impractical. I'm a little stumped as to what to do, particularly since the code does work, but only for a short time before it's halted. I was looking into proxies/headers, but my knowledge of either of these is unfortunately rather limited (although, of course, I'd be willing to learn if anyone has a suggestion involving either of these). Any help would be sincerely appreciated. Here's the code for reference:
for (i in 1:length(data1$Search)) {
url = data1$Search[i]
name = data1$Name[i]
download.file(url, destfile = paste(name, ".html", sep = ""), quiet = TRUE)
}
where data1 is a two column dataframe with the columns "Search" and "Name". Once again, any suggestions are much welcome. Thank you.
Dear Stackoverflow users,
I am using R to scrape profiles of a few psycotherapists from Psychology Today; this is done for exercising and learning more about web scraping.
I am new to R and I I have to go through this intense training that will help me with a future projects. It implies that I might not know precisely what I am doing at the moment (e.g. I might not interpret well either the script or the error messages from R), but I have to get it done. Therefore, I beg your pardon for possible misunderstandings or inaccuracies.
In short, the situation is the following.
I have created a function through which I scrape information from 2 nodes of psycotherapists' profiles; the function is showed on this stackoverflow post.
Then I create a loop where that function is used on a few psycotherapists' profiles; the loop is in the above post as well, but I report it below because that is the part of the script that generates some problems (additionally to what I solved in the above mentioned post).
j <- 1
MHP_codes <- c(150140:150180) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NA)
j <- j + 1
}
when the loop is done, I bind the information from different profiles into one data frame and save it.
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
The function (getProfile) works well on individual profiles.
It works also on a small range of profiles ( c(150100:150150)).
Please, note that I do not know what psychoterapist id is actually assigned; so, many URLs within the range do not exist.
However, generally speaking, tryCatch should solve this issue. When an URL is non-existent (and thus the ID is not associated to any psychoterapist), each of the 2 nodes (and thus each of the 2 corresponding variables in my data frame) are empty (i.e. the data frame shows NAs in the corresponding cells).
However, in some IDs ranges, two problems might happen.
First, I get one error message such as teh following one:
Error in open.connection(x, "rb") : HTTP error 404.
So, this happens despite the fact that I am usign tryCatch and despite the fact that it generally appears to work (at least, until the error message appear).
Moreover, after the loop is stopped and R runs the line:
final_df <- rbind.fill(df_list)
A second error message appears:
Warning message:
In df[[var]] :
closing unused connection 3 (https://www.psychologytoday.com/us/therapists/illinois/150152)
It seems like there is a specific problem with that one empty URL.
In fact, when I change ID range, the loop works well despite non-existent URLs: on one hand, when the URL exists the information is scraped from the website, on the other hand, when the URL does not exists, the 2 variables associated to that URL (and thus to that psyciotherapist ID) get an NA.
Is it possible, perhaps, to tell R to skip the URL if it is empty? Without recording anything?
This solution would be excellent, since it would shrink the data frame to the existing URLs, but I do not know how to do it and I do not know whether it is a solution to my problem.
Anyone who is able to help me sorting out this issue?
Yes, you need to wrap a tryCatch around the read_html call. This is where R tries to connect to the website, so it will throw an error (as opposed to returning an empty object) there if fails to connect. You can catch that error and then use next to tell R to skip to the next iteration of the loop.
library(rvest)
##Valid URL, works fine
URL <- "https://news.bbc.co.uk"
read_html(URL)
##Invalid URL, error raised
URL <- "https://news.bbc.co.uk/not_exist"
read_html(URL)
##Leads to error
Error in open.connection(x, "rb") : HTTP error 404.
##Invalid URL, catch and skip to next iteration of the loop
URL <- "https://news.bbc.co.uk/not_exist"
tryCatch({
URL <- read_html(URL)},
error=function(e) {print("URL Not Found, skipping")
next})
I would like to thank #Jul for the answer.
Here I post my updated loop:
j <- 1
MHP_codes <- c(150000:150200) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
delayedAssign("do.next", {next})
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- tryCatch(read_html(URL),
error = function(e) force(do.next))
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
As you can see, something had to be changed: although the answer from #Jul was close to solve the problem, the loop still stopped, and thus I had to slightly change the original suggestion.
In particular, I have introduced in the loop but outside of the tryCatch function the following line:
delayedAssign("do.next", {next})
And in the tryCatch function the following argument:
force(do.next)
This is based on this other stackoverlflow post.
I searched the site thoroughly but didn't find an answer. The script looks into a file directory and appends into a vector of filenames (which I then turn into a dataframe). For is then used to iterate through custom function which consumes as argument the file name. The parser works fine with one file name (i.e. when I force it in parser(filename) otherwise fails when runs through the for loop). So I suspect it's how I'm passing the argument.
files <- list.files(path="~/pathname", pattern="*.xml", full.names = F, recursive=FALSE)
files.df <- as.data.frame(files)
for (i in 1:nrow(files.df)) {
parser(as.character(files.df[i,]))
}
When I print(as.character(files.df[i,]), the output is as expected. Again, running the custom parser with one file name (outside the for loop) as argument works perfectly. The setwd is the same as that of where the files are read from.
Anything you may have come across before?
Update: the script worked fine yesterday with one file but now I get the same error. It seems to happen at this line:
xmlfile=xmlParse(filename, useInternalNodes = TRUE)
I have a question about downloading files. I know how to download files, using the download.file function. I need to download multiple files from a particular site, each file corresponding to a different date. I have a series of dates, using which I can prepare the URL to download the file. I know for a fact that for some particular dates, the files are missing on the website. Subsequently my code stops at that point. I then have to manually reset the date index (increment it by 1) and re-run the code. Since I have to download more than 1500 files, I was wondering if I can somehow capture the 'absence of the file' and instead of the code stopping, it continues with the next date in the array.
Below is the dput of a part of the date array:
dput(head(fnames,10))
c("20060102.trd", "20060103.trd", "20060104.trd", "20060105.trd",
"20060106.trd", "20060109.trd", "20060110.trd", "20060112.trd",
"20060113.trd", "20060116.trd")
This file has 1723 dates. Below is the code that I am using:
for (i in 1:length(fnames)){
file <- paste(substr(fnames[i],7,8), substr(fnames[i],5,6), substr(fnames[i],1,4), sep = "")
URL <- paste("http://xxxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
The program works fine, till it encounters a particular date for which the file is missing, and it stops. Is there a way to capture this, and print the missing file name (the variable 'file'), and move on to the next date in the array?
Please help.
I apologize that I have not shared the exact URL. In case it becomes difficult to simulate the issue, then please let me know.
* Trying to incorporate #Paul's suggestion.
I worked on a smaller dataset.
dput(testnames) is
c("20120214.trd", "20120215.trd", "20120216.trd", "20120217.trd",
"20120221.trd")
I know that file corresponding to the date '20120216' is missing from the website. I altered my code to incorporate the tryCatch function. Below it is:
tryCatch({for (i in 1:length(testnames)){
file <- paste(substr(testnames[i],7,8), substr(testnames[i],5,6), substr(testnames[i],1,4), sep = "")
URL <- paste("http://xxxx_",file,".zip",sep="")
download.file(URL, paste(file, "zip", sep = "."))
unzip(paste(file, "zip", sep = "."))}
},
error = function(e) {cat(file, '\n')
i=i+1},
warning = function(w) {message('cannot unzip')
i=i+1}
)
It runs fine for the first two dates, and as expected, throws an error for the 3rd one. I am facing 2 issues:
When I 'exclude' the warning block, it gives me the missing file name file as coded in the error block. But when I 'include' the warning block, it only issues the warning, and somehow doesnt execute the error block. Why is that?
In either case, the code stops after reading "20120216.trd" and doesnt proceed ahead with the next file, which is desirable. Is incrementing the variable i not sufficient for that purpose?
Please advise.
You can do this using tryCatch. This function will try the operation you feed it, and provide you with a way to dealing with errors. For example, in your case an error could simply lead to skipping the file and ignoring the error. For example:
skip_with_message = simpleError('Did not work out')
tryCatch(print(bla), error = function(e) skip_with_message)
# <simpleError: Did not work out>
Notice that the error here is that the bla object does not exist.
This problem is similar to that seen here.
I have a large number of large CSVs which I am loading and parsing serially through a function. Many of these CSVs present no problem, but there are several which are causing problems when I try to load them with read.csv().
I have uploaded one of these files to a public Dropbox folder here (note that the file is around 10.4MB).
When I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and Rhelp for solutions. Maddeningly, when I run
Import <- read.csv("http://dl.dropbox.com/u/83576/Candidate%20Mentions.csv")
using the Dropbox URL instead of my local path, it loads, but when I then save that very data frame and try to reload it thus:
write.csv(Import, "Test_File.csv", row.names = F)
TestImport <- read.csv("Test_File.csv")
I get the "incomplete final line" warning again.
So, I am wondering why the Dropbox-loaded version works, while the local version does not, and how I can make my local versions work -- since I have somewhere around 400 of these files (and more every day), I can't use a solution that can't be automated in some way.
In a related problem, perhaps deserving of its own question, it appears that some "special characters" break the read.csv() process, and prevent the loading of the entire file. For example, one CSV which has 14,760 rows only loads 3,264 rows. The 3,264th row includes this eloquent Tweet:
"RT #akiron3: ácÎå23BkªÐÞ'q(#BarackObama )nĤÿükTPP ÍþnĤüÈ’áY‹ªÐÞĤÿüŽ
\&’ŸõWˆFSnĤ©’FhÎåšBkêÕ„kĤüÈLáUŒ~YÒhttp://t.co/ABNnWfTN
“jg)(WˆF"
Again, given the serialized loading of several hundred files, how can I (a) identify what is causing this break in the read.csv() process, and (b) fix the problem with code, rather than by hand?
Thanks so much for your help.
1)
suppressWarnings(TestImport <- read.csv("Test_File.csv") )
2) Unmatched quotes are the most common cause of apparent premature closure. You could try adding all of these:
quote="", na,strings="", comment.char=""