R: Errors when webscraping across mulitple tables with same URL - r

I'm fairly new to webscraping and having issues troubleshooting my code. At the moment I'm having different errors every time and don't really know where to continue. Currently looking into utilizing RSelenium but would greatly appreciate some advise and feedback on the code below.
Based my initial code on the following: R: How to web scrape a table across multiple pages with the same URL
library(xml2)
library(RCurl)
library(dplyr)
library(rvest)
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}
table$`ÅR`=as.Date(table$`ÅR`,format ="%Y")
Bellow are the errors i am recieving at the moment. I know its a lot, but i assume some of them are a result of previous errors. Any help would be greatly appreciated!
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
Error: unexpected ',' in:
"for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","
page <- read_html(data)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "function"
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "function"
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
Error in is.factor(x) : object 'table1' not found
table=bind_rows(table, table1)
Error in list2(...) : object 'table1' not found
print(i)}
Error: unexpected '}' in " print(i)}"
table$ÃR=as.Date(table$ÃR,format ="%Y")

The following code produces a dataframe containing all the data you are seeking. Rather than using RSelenium, the below code fetches the data directly from the same API from which the site populates the table and so you do not need to combine multiple pages:
library(tidyverse)
library(rvest)
library(jsonlite)
####GET NUMBER OF ITEMS#####
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
data <- jsonlite::fromJSON(url, flatten = TRUE)
totalItems <- data$TotalNumberOfItems
####GET ALL OF THE ITEMS#####
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())

Related

How to skip error Error in open.connection(x, "rb") : HTTP error 504

I am new to the world of r, I have not been able to skip the URLs that according to the website show: “ 504 error That content doesn't seem to exist…”
There exists a list of people on the website that I need to get the table and also information in the nested links for each of those people.
But only the webpage is giving 504 error for 1 person (84th person) so I would like to know how I can skip the page so that in my data frame the webpage for that specific person to be marked as non-existent.
Thanks for your help.
here is my code:
***library(rvest)
library(dplyr)
library(stringr)
library(jsonlite)
library(readr)
url="https://www.barrons.com/advisor/report/top-financial-advisors/100?id=/100/2022&type=ranking_tables"
doc = fromJSON(txt=url)
result = doc$data$data
print(result)
link=str_split_fixed(doc$data$data$Advisor, "\'", n = Inf)
advisor_links= link[,4]
for (i in 1: length(advisor_links)){
name_link=advisor_links[i]
advisor_page= read_html(name_link)
position= advisor_page%>% html_nodes(".BarronsTheme--lg--18rTokdG p:nth-child(1)")%>% html_text()%>% paste(collapse = ",")
print(position)
}***
If you know the index of the person you want to remove, you can simply omit it in the advisor_links before you call your for loop function.
advisor_links <- advisor_links[-84]
If there are multiple websites that errors out
I would suggest using tryCatch function (How to write trycatch in R) and put it inside your for loop function like so:
for (i in 1: length(advisor_links)){
name_link = advisor_links[i]
tryCatch({
name_link=advisor_links[i]
advisor_page = read_html(name_link)
position = advisor_page %>%
html_nodes(".BarronsTheme--lg--18rTokdG p:nth-child(1)")%>%
html_text()%>%
paste(collapse = ",")
if(position == "") print("Non-existent")
else print(position)}, error = function(e) NULL)
}

Problems extracting data using JSON in R (getting a lexical error)

Related to the question asked here: R - Using SelectorGadget to grab a dataset
library(rvest)
library(jsonlite)
library(magrittr)
library(stringr)
library(purrr)
library(dplyr)
get_state_index <- function(states, state) {
return(match(T, map(states, ~ {
.x$name == state
})))
}
s <- read_html("https://www.opentable.com/state-of-industry") %>% html_text()
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
fullbook <- all_data$covidDataCenter$fullbook
hawaii_dataset <- tibble(
date = fullbook$headers %>% unlist() %>% as.Date(),
yoy = fullbook$states[get_state_index(fullbook$states, "Hawaii")][[1]]$yoy %>% unlist()
)
I am trying to grab the Hawaii dataset from the State tab. The code was working before but now it is throwing an error with this part of the code:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = (.*?\\});w\\.")[, 2])
I am getting the error:
Error: lexical error: invalid char in json text. NA (right here) ------^
Any proposed solutions? It seems that the website has remained the same for the year but what type of change is causing the code to break?
EDIT: The solution proposed by #QHarr:
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
This was working for a while but then it seems that their website again changed the underlying HTML codes.
Change the regex pattern as shown below to ensure it correctly captures the desired string within the response text i.e. the JavaScript object to use for all_data
all_data <- jsonlite::parse_json(stringr::str_match(s, "__INITIAL_STATE__ = ([\\s\\S]+\\});")[, 2])
Note: in R the single escape is doubled e.g. \\s rather than shown \s above.

Scraping links in df columns with rvest

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?
This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)
library(rvest)
df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")
df$links_text <- lapply(df$testo, function(x) {
page <- read_html(x)
links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
html_text(trim = T)
})
df$text <- lapply(df$links_text, function(x) {
page1 <- read_html(x)
links1 <- html_nodes(page, 'p') %>%
html_text(trim = T)
})
You want links1 <- html_nodes(page, 'p') to refer to page1, not page.
[Otherwise (as there is no object page in the function environment, it is trying to apply html_nodes to the utils function page]
In terms of Sys_sleep, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!
You can just include Sys.sleep(n) in your function where you create df$text. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!
You may do this in single sapply command and use tryCatch to handle errors.
library(rvest)
df$text <- sapply(df$testo, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('.value:nth-child(8) .fixed') %>%
html_text(trim = T) %>%
read_html %>%
html_nodes('p') %>%
html_text(trim = T) %>%
toString()
}, error = function(e) NA)
})

Scraping through multiple websites

I am trying to get multiple tables and later transform it (after some manipulation) into one dataframe in R
see code below
countries <- c("au","at","de","se","gb","us")
for (i in countries) {
sides<-glue("https://www.beeradvocate.com/beer/top-rated/",i,.sep = "")
html[i] <- read_html(sides)
cont[i] <- html[i] %>%
html_nodes("table") %>% html_table()
}
If I do so, I get the following error message:
*number of items to replace is not a multiple of replacement lengthError in
UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class
"list"*
Can someone help me?
Thanks a lot!!!
require(tidyverse)
require(rvest)
path_base <- "https://www.beeradvocate.com/beer/top-rated/"
countries <- c("au","at","de","se","gb","us")
path <- paste0(path_base, countries)
html_files <- path %>%
map(read_html)
html_files %>%
map(html_node, css = "table") %>%
map(html_table, header = TRUE, fill = TRUE)

Rvest scraping errors

Here's the code I'm running
library(rvest)
rootUri <- "https://github.com/rails/rails/pull/"
PR <- as.list(c(100, 200, 300))
list <- paste0(rootUri, PR)
messages <- lapply(list, function(l) {
html(l)
})
Up until this point it seems to work fine, but when I try to extract the text:
html_text(messages)
I get:
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
Trying to extract a specific element:
html_text(messages[1])
Can't do that either...
Error in xml_apply(x, XML::xmlValue, ..., .type = character(1)) :
Unknown input of class: list
So I try a different way:
html_text(messages[[1]])
This seems to at least get at the data, but is still not succesful:
Error in UseMethod("xmlValue") :
no applicable method for 'xmlValue' applied to an object of class "c('HTMLInternalDocument', 'HTMLInternalDocument', 'XMLInternalDocument', 'XMLAbstractDocument')"
How can I extract the text material from each of the elements of my list?
There are two problems with your code. Look here for examples on how to use the package.
1. You cannot just use every function with everything.
html() is for download of content
html_node() is for selecting node(s) from the downloaded content of a page
html_text() is for extracting text from a previously selected node
Therefore, to download one of your pages and extract the text of the html-node, use this:
library(rvest)
old-school style:
url <- "https://github.com/rails/rails/pull/100"
url_content <- html(url)
url_mainnode <- html_node(url_content, "*")
url_mainnode_text <- html_text(url_mainnode)
url_mainnode_text
... or this ...
hard to read old-school style:
url_mainnode_text <- html_text(html_node(html("https://github.com/rails/rails/pull/100"), "*"))
url_mainnode_text
... or this ...
magritr-piping style
url_mainnode_text <-
html("https://github.com/rails/rails/pull/100") %>%
html_node("*") %>%
html_text()
url_mainnode_text
2. When using lists you have to apply functions to the list with e.g. lapply()
If you want to kind of batch-process several URLs you can try something like this:
url_list <- c("https://github.com/rails/rails/pull/100",
"https://github.com/rails/rails/pull/200",
"https://github.com/rails/rails/pull/300")
get_html_text <- function(url, css_or_xpath="*"){
html_text(
html_node(
html("https://github.com/rails/rails/pull/100"), css_or_xpath
)
)
}
lapply(url_list, get_html_text, css_or_xpath="a[class=message]")
You need to use html_nodes() and identify which CSS selectors relate to the data you're interested in. For example, if we want to extract the usernames of the people discussing pull 200
rootUri <- "https://github.com/rails/rails/pull/200"
page<-html(rootUri)
page %>% html_nodes('#discussion_bucket strong a') %>% html_text()
[1] "jaw6" "jaw6" "josevalim"

Resources