Why is my loop not working when I webscrape on R? - r

library(tidyverse)
library(rvest)
library(htmltools)
library(xml2)
library(dplyr)
#import using read_html
results <- read_html("https://www.artemis.bm/deal-directory/")
#Results are in a list - can look at list by running 'results' in the console
CODE - The issuers information extracted
issuers <- results %>% html_nodes("#table-deal a") %>% html_text()
cedent <- results %>% html_nodes("td:nth-child(2)") %>% html_text()
risks <- results %>% html_nodes("td:nth-child(3)") %>% html_text()
size <- results %>% html_nodes("td:nth-child(4)") %>% html_text()
date <- results %>% html_nodes("td:nth-child(5)") %>% html_text()
#This scrapes all of the links for each issuer page
url <- results %>% html_nodes("#table-deal a") %>% html_attr("href")
#getting data from within the links
get_placement = function(url_link) {
url_link = read_html("https://www.artemis.bm/deal-directory/cape-lookout-re-ltd-series-2022-1/")
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
}
This code works and the last bit from get_placement gets the information I am after (the placement section) - whichever link I put in it gives me the placement for that informational. However, when i try and loop it it does not work
#here is my issue
get_placement = function(url_link) {
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
return(placement)
}
This only gives me one value when I need the placement information from all 833?
issuer_placement = sapply(url, FUN = get_placement)
When I try to use sapply I get this message
Browse[1]> issuer_placement = sapply(url, FUN = get_placement)
Error during wrapup: no applicable method for 'read_xml' applied to an object of class "name"
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
and
function (con, open = "r", blocking = TRUE, ...)
.Internal(open(con, open, blocking))

This worked for me without any problem
issue_placement <- lapply(url, function(u) {
tryCatch(return(get_placement(u)),
error=function(e) return("Not retrieved - error"),
warning=function(w) return("Not retrieved - warning"))
})
When I pushed issue_placement into a data.table (see below), I found found 330 unique results, and no errors/warnings
data.table::data.table(placement = unlist(issue_placement))[,.N, placement]

Related

R: webscraping returns no applicable method for 'xml_find_all' applied to an object of class "character"?

I'm using RSelenium and purrr functions to generate a df with all the products in this page and their prices:
https://www.lacuracao.pe/curacao/tv-y-audio/televisores
I'm getting this error, why?
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
Code:
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
#start RSelenium
rD <- rsDriver(port = 4560L, browser = "chrome", version = "3.141.59", chromever = "93.0.4577.63",
geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
verbose = TRUE, check = TRUE)
remDr <- rD[["client"]]
Sys.sleep(10)
tvs_url <- "https://www.lacuracao.pe/curacao/tv-y-audio/televisores"
remDr$navigate(tvs_url)
Sys.sleep(10)
#scroll down 20 times, waiting for the page to load at each time
for(i in 1:20){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(5)
}
h<-remDr$getPageSource()
df <- map_dfr(h %>%
map(~ .x %>%
html_nodes("div.product")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "lacuracao",
producto = .x %>% html_node(".product_name") %>% html_text(),
precio.antes = .x %>% html_node('.old-price') %>% html_text(),
precio.actual = .x %>% html_node('#offerPriceValue') %>% html_text()
))
Update 1:
I've changed h<-remDr$getPageSource() to h<-remDr$getPageSource()[[1]] and now class(h) returns character.
Update 2:
Tried:
h<-remDr$getPageSource()[[1]]
hh <- h %>% read_html() %>% html_elements("div.product")
class(hh) #[1] "xml_nodeset"
But getting this when trying to form the df:
Error in data.frame(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :
arguments imply differing number of rows: 1, 0
Use remDr$getPageSource()[[1]] to get the actual document.
You then need to pipe that to your DOM parser i.e. remDr$getPageSource()[[1]] %>% read_html() and continue on as before i.e. ...%>% html_elements(.....).
RSelenium has its own methods for selecting elements via the Webdriver instance e.g. remDr$findElement("css", "body"). In your case, you are choosing to transfer the html across into something which you can call rvest's html_nodes() on i.e.
either a document, a node set or a single node.. As the transfer is html, then read_html() is needed to generate a document for parsing.
The error inside the attempt to form a data.frame call is because you need to implement handling of missing child nodes i.e. where certain prices are missing.

Scraping links in df columns with rvest

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?
This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)
library(rvest)
df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")
df$links_text <- lapply(df$testo, function(x) {
page <- read_html(x)
links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
html_text(trim = T)
})
df$text <- lapply(df$links_text, function(x) {
page1 <- read_html(x)
links1 <- html_nodes(page, 'p') %>%
html_text(trim = T)
})
You want links1 <- html_nodes(page, 'p') to refer to page1, not page.
[Otherwise (as there is no object page in the function environment, it is trying to apply html_nodes to the utils function page]
In terms of Sys_sleep, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!
You can just include Sys.sleep(n) in your function where you create df$text. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!
You may do this in single sapply command and use tryCatch to handle errors.
library(rvest)
df$text <- sapply(df$testo, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('.value:nth-child(8) .fixed') %>%
html_text(trim = T) %>%
read_html %>%
html_nodes('p') %>%
html_text(trim = T) %>%
toString()
}, error = function(e) NA)
})

How to skip a page scrape when table is missing in R

I'm building a scrape that pulls the name of a player and the years he played for thousands of different players. I have built an otherwise successful function to do this but unfortunately in some instances the table with the other half of data I need (years played) does not exist. For these instances, I'd like to add a way to tell the scrape to bypass these instances. Here is the code:
(note: the object "url_final" is the list of active webpage URLs of which there are many)
library(rvest)
library(curl)
library(tidyverse)
library(httr)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
fyr <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>%
.[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x, pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
})
Here is an example of an active page in which you can recreate the scrape by replacing "url_final" as the .x call in map_dfr with:
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
Here is an example of one of the instances in which there is no table and thus returns an error breaking the loop of the scrape.
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
How about adding try-Catch which will ignore any errors?
library(tidyverse)
library(rvest)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
tryCatch({
fyr <- read_html(curl::curl(x,
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>% .[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x,
pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
}, error = function(e) message('Skipping url', x))
})

rvest, following a link present on each node to get more data?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_more_info <- function(position) {
page <- follow_link(page, css = ".grpl-moreinfo > a:nth-child(" + position + ")")
html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
page <- page %>% back()
}
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
df$fb <- lapply(1:nrow(df), get_more_info)
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
as of now I'm getting an error:
Error in ".grpl-moreinfo > a:nth-child(" + position :
non-numeric argument to binary operator
As you can see I'm attempting to follow the link at each element by using the "get_more__data" function and applying it to the amount of elements on a page using lapply
Is there better way to do this? What am I doing wrong?
I think your solution is way easier than you thought it is.
In line 4 you used
page <- follow_link(page, css = ".grpl-moreinfo > a:nth-child(" + position + ")")
where
css = ".grpl-moreinfo > a:nth-child(" + position + ")"
in R you do not concatenate character strings with "+", i.e. it does not work to use
"He" + "llo"
Try it again using: paste('He', 'llo', sep = '') or paste0('He', 'llo')
Please try the next time to look at the error massage itself. It tells you very often exactly where the error is coming from.
edit:
If you want to use it like in Python you can write your own function like that:
`+` <- function(x, y){
return(paste0(x, y))
}
I wouldn't recommend it, but it's possible.

Rvest, looping through elements on a page in order to follow a link at each element?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
The current error I'm getting is:
Error in `$<-.data.frame`(`*tmp*`, "logo", value = "https://uws-community.symplicity.com/") :
replacement has 1 row, data has 0
I know I need to make a function that will go through each element and follow the link, then mapply that function to the dataframe df. However I don't know how I'd go about making that function so that it would work correctly.
your error says that you are trying to combine two different dimensions... your page variable already has one dimension and second is 0. page <- html_session(url) add this inside you function.
This is a reproducable example of your error message.
x = data.frame()
x[1] <- c(1)
I haven't checked your code, but the error is in there, you have to go step by step through your code. You will find the error, where you've created an empty data.frame and then tried to assign a value to it.
good luck

Resources