Web scraping suddenly stopped working

Web scraping suddenly stopped working - r

I am going to start off by saying that my knowledge of XML is pretty minimal.
I promise you than until 2 or 3 days ago the following code worked perfectly:
library("rvest")
url<-"https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election"
H<-read_html(url)
table<-html_table(H, fill=TRUE)
Z<-table[1]; Z1<-Z[[1]]
Which then allowed me to get on and do what I wanted, extracting the first table from that web page and putting it in data frame Z1. However, this has suddenly stopped working and I keep getting the error message:
Error in if (length(p) > 1 & maxp * n != sum(unlist(nrows)) & maxp * n != :
missing value where TRUE/FALSE needed
When I look at H it seems no longer to be a list and now looks like this:
{xml_document}
<html class="client-nojs" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject ...
It is clearly failing at html_table.
I really don't know where to start with this at all.

I believe you missed a step in parsing out the table nodes before the html_table function.
library("rvest")
url<-"https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election"
H<-read_html(url)
tables<-html_nodes(H, "table")
Z1<-html_table(tables[1], fill = TRUE)[[1]]

Related

rvest - remove tags and its content from HTML string

Suppose I have the below text:
x <- "<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code>. Furthermore, the results from <code>testthat</code> should be saved in the JUnit XML format and the results from <code>covr</code> should be saved in the Cobertura format.</p>\n\n<p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>\n\n<pre><code>options(\"testthat.output_file\" = \"test-results.xml\")\ndevtools::test(reporter = testthat::JunitReporter$new())\n\ncov <- covr::package_coverage()\ncovr::to_cobertura(cov, \"coverage.xml\")\n</code></pre>\n\n<p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::package_coverage</code>. </p>\n\n<p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test-results.xml</code>.</p>\n\n<p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a single execution of the test suite.</p>\n"
**PROBLEM: **
I need to do remove all <code></code> tags and its content, regardless if they are on its own or inside another tag.
I HAVE TRIED:
I have tried the following, but as you can see, the tags are still there:
content <- xml2::read_html(x) %>%
rvest::html_nodes(css = ":not(code)")
print(content)
But the result I get is the following, and the tags are still there:
{xml_nodeset (8)}
[1] <body>\n<p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>cov ...
[2] <p>I would like to run tests for a package with <code>testthat</code> and compute code coverage with <code>covr</code> ...
[3] <p>The following code does the trick (when <code>getwd()</code> is the root of the package):</p>
[4] <pre><code>options("testthat.output_file" = "test-results.xml")\ndevtools::test(reporter = testthat::JunitReporter$new ...
[5] <p>However, the tests are executed <em>twice</em>. Once with <code>devtools::test</code> and once with <code>covr::pac ...
[6] <em>twice</em>
[7] <p>My understanding is that <code>covr::package_coverage</code> executes the tests, but it does not produce <code>test ...
[8] <p>As the title suggests, I would like get both <code>test-results.xml</code> and <code>coverage.xml</code> with a sin ...

The solution was the following:
content <- xml2::read_html(x)
toRemove <- content %>% rvest::html_nodes(css = "code")
xml_remove(toRemove)
After that, content had no code tags, nor its content, and this wasn't manipulated as string.

How to force a For loop() or lapply() to run with error message in R

On this code when I use for loop or the function lapply I get the following error
"Error in get_entrypoint (debug_port):
Cannot connect R to Chrome. Please retry. "
library(rvest)
library(xml2) #pull html data
library(selectr) #for xpath element
url_stackoverflow_rmarkdown <-
'https://stackoverflow.com/questions/tagged/r-markdown?tab=votes&pagesize=50'
web_page <- read_html(url_stackoverflow_rmarkdown)
questions_per_page <- html_text(html_nodes(web_page, ".page-numbers.current"))[1]
link_questions <- html_attr(html_nodes(web_page, ".question-hyperlink")[1:questions_per_page],
"href")
setwd("~/WebScraping_chrome_print_to_pdf")
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
pagedown::chrome_print(question_to_pdf)
}
Is it possible to build a for loop() or use lapply to repeat the code from where it break? That is, from the last i value without breaking the code?
Many thanks

I edited #Rui Barradas idea of tryCatch().
You can try to do something like below.
The IsValues will get either the link value or bad is.
IsValues <- list()
for (i in 1:length(link_questions)) {
question_to_pdf <- paste0("https://stackoverflow.com",
link_questions[i])
IsValues[[i]] <- tryCatch(
{
message(paste("Converting", i))
pagedown::chrome_print(question_to_pdf)
},
error=function(cond) {
message(paste("Cannot convert", i))
# Choose a return value in case of error
return(i)
})
}
Than, you can rbind your values and extract the bad is:
do.call(rbind, IsValues)[!grepl("\\.pdf$", do.call(rbind, IsValues))]
[1] "3" "5" "19" "31"
You can read more about tryCatch() in this answer.

Based on your example, it looks like you have two errors to contend with. The first error is the one you mention in your question. It is also the most frequent error:
Error in get_entrypoint (debug_port): Cannot connect R to Chrome. Please retry.
The second error arises when there are links in the HTML that return 404:
Failed to generate output. Reason: Failed to open https://lh3.googleusercontent.com/-bwcos_zylKg/AAAAAAAAAAI/AAAAAAAAAAA/AAnnY7o18NuEdWnDEck_qPpn-lu21VTdfw/mo/photo.jpg?sz=32 (HTTP status code: 404)
The key phrase in the first error is "Please retry". As far as I can tell, chrome_print sometimes has issues connecting to Chrome. It seems to be fairly random, i.e. failed connections in one run will be fine in the next, and vice versa. The easiest way to get around this issue is to just keep trying until it connects.
I can't come up with any fix for the second error. However, it doesn't seem to come up very often, so it might make sense to just record it and skip to the next URL.
Using the following code I'm able to print 48 of 50 pages. The only two I can't get to work have the 404 issue I describe above. Note that I use purrr::safely to catch errors. Base R's tryCatch will also work fine, but I find safely to be a little more convient. That said, in the end it's really just a matter of preference.
Also note that I've dealt with the connection error by utilizing repeat within the for loop. R will keep trying to connect to Chrome and print until it is either successful, or some other error pops up. I didn't need it, but you might want to include a counter to set an upper threshold for the number of connection attempts:
quest_urls <- paste0("https://stackoverflow.com", link_questions)
errors <- NULL
safe_print <- purrr::safely(pagedown::chrome_print)
for (qurl in quest_urls){
repeat {
output <- safe_print(qurl)
if (is.null(output$error)) break
else if (grepl("retry", output$error$message)) next
else {errors <- c(errors, `names<-`(output$error$message, qurl)); break}
}
}

HTML pages not retained in list while using mclapply

While using simply lapply read_html page results are retained.
library(xml2)
lapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)})
#> [[1]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-45087 s ...
#>
#> [[2]]
#> {xml_document}
#> <html>
#> [1] <head lang="en-US" prefix="og: http://ogp.me/ns#">\n<meta http-equiv ...
#> [2] <body class="post-template-default single single-post postid-46725 s ...
While using Parallel mclapply:
library(xml2)
library(parallel)
mclapply(c("https://www.analyticsvidhya.com/blog/2018/06/datahack-radio-1-machine-learning-competitions-with-kaggle-ceo-anthony-goldbloom/","https://www.analyticsvidhya.com/blog/2018/09/datahack-radio-lyft-dr-alok-gupta/"), function(x){read_html(x)}, mc.cores = 2)
#> [[1]]
#> {xml_document}
#>
#> [[2]]
#> {xml_document}
I can't figure out why it's happening, even with foreach I'm not able to get the desired results as normal lapply. Help!

Time to sew
(I mean, you used the word thread so I'm not passing up the opportunity for a pun or three).
Deep in the manual page for ?parallel::mclapply you'll eventually see that it works by:
forking processes
serializing results
eventually gathering up these serialized results and combining them into one object
You can read ?serialize to see the method used.
Why cant we serialize xml_document/html_document objects?
First, let's make one:
library(xml2)
(doc <- read_html("<p>hi there!</p>"))
## {xml_document}
## <html>
## [1] <body><p>hi there!</p></body>
and look at the structure:
str(doc)
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
doc$node
## <pointer: 0x7ff45ab17ce0>
Hrm. Those are <externalptr> objects. What does ?"externalptr-class" (eventually) say abt them?
…
"externalptr" # raw external pointers for use in C code
Since it's not a built-in object and the data is hidden away and only accessible via the package interface, R can't serialize it on its own and needs help. (That hex string — 0x7ff45ab17ce0 — is the memory pointer to where this opaque data is hidden).
"You can't be serious…"
Totally am.
In the event you're from Missouri (the "Show Me" state), we can see what happens without the complexity of parallel ops and raw connection object serialization machinations by just trying to save the document above to an RDS file and read it back:
tf <- tempfile(fileext = ".rds")
saveRDS(doc, tf)
(doc2 <- readRDS(tf))
## List of 2
## $ node:<externalptr>
## $ doc :<externalptr>
## - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
Now, you may be all like "AHA! See, it works!" Aaaaand…you'd be wrong:
doc2$node
## <pointer: 0x0>
The 0x0 means it's not pointing to anything. You've lost all that data. It's gone. Forever. (But, it had a good run so we should not be too sad abt it).
This has been discussed by the xml2 devs and — rather than make life easier for us — they punted and made ?xml_serialize.
Wait…there's an xml_serialize but it's kinda not all that useful?
Yep. And, it gets even better worse.
Hopefully your curiosity was sufficiently piqued that you went ahead and found out what this quite seriously named xml_serialize() function does. If not, this is R, so to find out just type it's name without the () to get:
function (object, connection, ...)
{
if (is.character(connection)) {
connection <- file(connection, "w", raw = TRUE)
on.exit(close(connection))
}
serialize(structure(as.character(object, ...), class = "xml_serialized_document"),
connection)
}
Apart from wiring up some connection bits, the complex sorcery behind this xml_serialize function is, well, just as.character(). (Kind of a let-down, actually.)
Since parallel ops perform (idiomatically) the equivalent of saveRDS() => readRDS() when you return an xml_document, html_document (or their _node[s] siblings) in a parallel apply you eventually get back a whole pile of nothing.
What can a content thief innocent scraper do to overcome this devastating limitation?
You are left with (at minimum) four choices:
🤓 expand the complexity of your function in the parallel apply to process the XML/HTML document into a data frame, vector or list of objects that can all be serialized automagically by R so they can be combined for you
be cool 😎 and have one parallel apply that saves off the HTML into files (the HTTP ops are likely the slow bit anyway) and then a non-parallel operation that read them sequentially and processes them — which it looks like you were going to do anyway. Note that you're kind of being a leech and rly bad netizen if you don't do the HTML caching to file anyway since you're showing you really don't care about the bandwidth and CPU costs of the content you're purloining scraping.
don't be cool by doing ^^ 😔 and, instead, use as.character((read_html(…)) to return raw, serializable, character HTML directly from your parallel apply and then re-xml2 them back in the rest of your program
😱 fork the xml2 📦, layer in a proper serialization hack and don't bother PR'ing it since you'll likely spend alot of time trying to convince them it's worth it and still end up failing since this "externalptr serializing` is tricksy business, fraught with peril and you likely missed some edge cases (i.e. Hadley/Jim/etc know what they're doing and if they punted, it's prbly something not worth doing).
In reality, rather than use xml2::read_html() to grab the content, I'd use httr::GET() + httr::content(…, as="text") instead (if you're being cool and caching the pages vs callously wasting other folks' resources) since read_html() uses libxml2 under the covers and transforms the document (even if sometimes just a little) and it's better to have untransformed raw, cached source data vs something mangled by software that thinks its smarter than we are.
FIN
There really isn't any more I can do to clarify this than the above, verbose-mode blathering. Hopefully this expansion also helps others grok what's going on as well.

Resolving "Error: subscript out of bounds" in a code loop

I am trying to run an R loop on an individual based model. This includes two lists referring to grid cells, which I originally ran into difficulties with because they returned the error: Error: (list) object cannot be coerced to type 'double'. I think I have resolved this error by typing "as.numeric(unlist(x))."
Example from code:
List 1:
dredg<-list(c(943,944,945,946,947,948,949...1744,1745))
dredging<-as.numeric(unlist(dredg)). I refer to 'dredging' in my code, not 'dredg.'
List 2:
nodredg<-list(c(612,613,614,615,616,617,618,619,620,621,622,623,624,625,626,627,628,629,630,631))
dcells<-as.numeric(unlist(nodredg)) I refer to 'dcells' in my code, not 'nodredg.'
However, now when I use these two number arrays (if that's what they are) my code returns the error
Error in dcells[[dredging[whichD]]] : subscript out of bounds
I believe this error is referring to the following lines of code:
if(julday>=dstart[whichD] & julday<=dend[whichD] & dredging[whichD]!=0){
whichcells<-which(gridd$inds[gridd$julday==julday] %in% dcells[[ dredging[whichD] ]]) #identify cells where dredging is occurring
}
where:
whichD<-1
julday<-1+day
'dstart=c(1,25,75,100)dend=c(20,60,80,117)`
Here is the full block of code:
for (i in 1:time.steps){
qday <- qday + 1
for (p in 1:pop){
if (is.na(dolphin.q[p,1,i]!=dolphin.q[p,2,i])) {
dolphin.q[p,6,i]<-which.max(c(dolphin.q[p,1,i],dolphin.q[p,2,i]))
}else{
dolphin.q[p,6,i]<-rmulti(c(0.5,0.5))
}
usP<-usage[p,]
if(julday>=dstart[whichD] & julday<=dend[whichD] & dredging[whichD]!=0){
whichcells<-which(gridd$inds[gridd$julday==julday] %in% dcells[[ dredging[whichD] ]])
usP[whichcells]<-0
usP<-usP/sum(usP) #rescale the home range surface to 1
}
I was wondering if anyone could show me what is going wrong? I apologize if this is a very simple mistake I am making, I am a novice learner that has been scouring the internet, manuals, and Stack for days with no luck!
Thanks in advance!

Debugging htmlParse in R's XML library

This is not the first time I've encountered a problem while using htmlParse in the XML library, but in the past I've just given up and used a regex to parse what I needed instead. I'd rather do it via parsing the XML/XHTML, since as we all know regexs aren't parsers.
That said, I find the error messages from the parse commands to be non-helpful at best, and I have no idea how to proceed. For instance:
> htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50))
Error in htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", :
File
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head id="ctl00_Head1">
<title></title>
<script language="JavaScript" type="text/javascript">
var s_pageName = document.title;
var s_channel = "Take Care";
var s_campaign = "";
var s_eVar1 = ""
var s_eVar2 = ""
var s_eVar22 = ""
var s_eVar23 = ""
</script>
<meta name="keywords" content="take care clinic, walgreens clinic, walgreens take care clinic, take care health, urgent care clinic, walk in clinic" />
<meta name="description" content="Information about simple, quality healthcare for the whole family from Take Care Clinics at select Walgreens, including Take Care Clinic hours, providers, offers, insurance and quality of care." />
<link rel="shortcut icon" hre
I'm glad it sees something in there, but where do I drill down past "Error: File"?
Note this is, as far as I can tell, well-formed XHTML. When I visit the link manually I can run xpaths on it and Firebug does not complain.
How do I debug errors from htmlParse like this?

Downloading first then passing to XML package seems to work
test<-getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50)
htmlParse(test,asText=T)
or directly
htmlParse(getForm("http://www.takecarehealth.com/LocationSearchResults.aspx", location_query="Deer Park",location_distance=50),asText=T)
also seems fine

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Web scraping suddenly stopped working - r

I believe you missed a step in parsing out the table nodes before the html_table function. library("rvest") url<-"https://en.wikipedia.org/wiki/Opinion_polling_for_the_next_United_Kingdom_general_election" H<-read_html(url) tables<-html_nodes(H, "table") Z1<-html_table(tables[1], fill = TRUE)[[1]]

Related

rvest - remove tags and its content from HTML string

How to force a For loop() or lapply() to run with error message in R

HTML pages not retained in list while using mclapply

Resolving "Error: subscript out of bounds" in a code loop

Debugging htmlParse in R's XML library

Categories

Resources