Scraping through multiple websites

Scraping through multiple websites - r

I am trying to get multiple tables and later transform it (after some manipulation) into one dataframe in R
see code below
countries <- c("au","at","de","se","gb","us")
for (i in countries) {
sides<-glue("https://www.beeradvocate.com/beer/top-rated/",i,.sep = "")
html[i] <- read_html(sides)
cont[i] <- html[i] %>%
html_nodes("table") %>% html_table()
}
If I do so, I get the following error message:
*number of items to replace is not a multiple of replacement lengthError in
UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class
"list"*
Can someone help me?
Thanks a lot!!!

require(tidyverse)
require(rvest)
path_base <- "https://www.beeradvocate.com/beer/top-rated/"
countries <- c("au","at","de","se","gb","us")
path <- paste0(path_base, countries)
html_files <- path %>%
map(read_html)
html_files %>%
map(html_node, css = "table") %>%
map(html_table, header = TRUE, fill = TRUE)

Related

Why is my loop not working when I webscrape on R?

library(tidyverse)
library(rvest)
library(htmltools)
library(xml2)
library(dplyr)
#import using read_html
results <- read_html("https://www.artemis.bm/deal-directory/")
#Results are in a list - can look at list by running 'results' in the console
CODE - The issuers information extracted
issuers <- results %>% html_nodes("#table-deal a") %>% html_text()
cedent <- results %>% html_nodes("td:nth-child(2)") %>% html_text()
risks <- results %>% html_nodes("td:nth-child(3)") %>% html_text()
size <- results %>% html_nodes("td:nth-child(4)") %>% html_text()
date <- results %>% html_nodes("td:nth-child(5)") %>% html_text()
#This scrapes all of the links for each issuer page
url <- results %>% html_nodes("#table-deal a") %>% html_attr("href")
#getting data from within the links
get_placement = function(url_link) {
url_link = read_html("https://www.artemis.bm/deal-directory/cape-lookout-re-ltd-series-2022-1/")
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
}
This code works and the last bit from get_placement gets the information I am after (the placement section) - whichever link I put in it gives me the placement for that informational. However, when i try and loop it it does not work
#here is my issue
get_placement = function(url_link) {
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
return(placement)
}
This only gives me one value when I need the placement information from all 833?
issuer_placement = sapply(url, FUN = get_placement)
When I try to use sapply I get this message
Browse[1]> issuer_placement = sapply(url, FUN = get_placement)
Error during wrapup: no applicable method for 'read_xml' applied to an object of class "name"
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
and
function (con, open = "r", blocking = TRUE, ...)
.Internal(open(con, open, blocking))

This worked for me without any problem
issue_placement <- lapply(url, function(u) {
tryCatch(return(get_placement(u)),
error=function(e) return("Not retrieved - error"),
warning=function(w) return("Not retrieved - warning"))
})
When I pushed issue_placement into a data.table (see below), I found found 330 unique results, and no errors/warnings
data.table::data.table(placement = unlist(issue_placement))[,.N, placement]

R: Errors when webscraping across mulitple tables with same URL

I'm fairly new to webscraping and having issues troubleshooting my code. At the moment I'm having different errors every time and don't really know where to continue. Currently looking into utilizing RSelenium but would greatly appreciate some advise and feedback on the code below.
Based my initial code on the following: R: How to web scrape a table across multiple pages with the same URL
library(xml2)
library(RCurl)
library(dplyr)
library(rvest)
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
page <- read_html(data)
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
table=bind_rows(table, table1)
print(i)}
table$`ÅR`=as.Date(table$`ÅR`,format ="%Y")
Bellow are the errors i am recieving at the moment. I know its a lot, but i assume some of them are a result of previous errors. Any help would be greatly appreciated!
i=1
table = list()
for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","?page=",i))
Error: unexpected ',' in:
"for (i in 1:15) {
data=("https://www.forsvarsbygg.no/no/salg-av-eiendom/solgte-eiendommer/","
page <- read_html(data)
Error in UseMethod("read_xml") :
no applicable method for 'read_xml' applied to an object of class "function"
table1 <- page %>%
html_nodes(xpath = "(//table)[2]") %>%
html_table(header=T)
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "function"
i=i+1
table1[[1]][[7]]=as.integer(gsub(",", "",table1[[1]][[7]]))
Error in is.factor(x) : object 'table1' not found
table=bind_rows(table, table1)
Error in list2(...) : object 'table1' not found
print(i)}
Error: unexpected '}' in " print(i)}"
table$ÃR=as.Date(table$ÃR,format ="%Y")

The following code produces a dataframe containing all the data you are seeking. Rather than using RSelenium, the below code fetches the data directly from the same API from which the site populates the table and so you do not need to combine multiple pages:
library(tidyverse)
library(rvest)
library(jsonlite)
####GET NUMBER OF ITEMS#####
url <- "https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/10/"
data <- jsonlite::fromJSON(url, flatten = TRUE)
totalItems <- data$TotalNumberOfItems
####GET ALL OF THE ITEMS#####
allData <- paste0('https://www.forsvarsbygg.no/ListApi/ListContent/78635/SoldEstates/0/', totalItems,'/') %>%
jsonlite::fromJSON(., flatten = TRUE) %>%
.[1] %>%
as.data.frame() %>%
rename_with(~str_replace(., "ListItems.", ""), everything())

R: webscraping returns no applicable method for 'xml_find_all' applied to an object of class "character"?

I'm using RSelenium and purrr functions to generate a df with all the products in this page and their prices:
https://www.lacuracao.pe/curacao/tv-y-audio/televisores
I'm getting this error, why?
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
Code:
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
#start RSelenium
rD <- rsDriver(port = 4560L, browser = "chrome", version = "3.141.59", chromever = "93.0.4577.63",
geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
verbose = TRUE, check = TRUE)
remDr <- rD[["client"]]
Sys.sleep(10)
tvs_url <- "https://www.lacuracao.pe/curacao/tv-y-audio/televisores"
remDr$navigate(tvs_url)
Sys.sleep(10)
#scroll down 20 times, waiting for the page to load at each time
for(i in 1:20){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(5)
}
h<-remDr$getPageSource()
df <- map_dfr(h %>%
map(~ .x %>%
html_nodes("div.product")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "lacuracao",
producto = .x %>% html_node(".product_name") %>% html_text(),
precio.antes = .x %>% html_node('.old-price') %>% html_text(),
precio.actual = .x %>% html_node('#offerPriceValue') %>% html_text()
))
Update 1:
I've changed h<-remDr$getPageSource() to h<-remDr$getPageSource()[[1]] and now class(h) returns character.
Update 2:
Tried:
h<-remDr$getPageSource()[[1]]
hh <- h %>% read_html() %>% html_elements("div.product")
class(hh) #[1] "xml_nodeset"
But getting this when trying to form the df:
Error in data.frame(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :
arguments imply differing number of rows: 1, 0

Use remDr$getPageSource()[[1]] to get the actual document.
You then need to pipe that to your DOM parser i.e. remDr$getPageSource()[[1]] %>% read_html() and continue on as before i.e. ...%>% html_elements(.....).
RSelenium has its own methods for selecting elements via the Webdriver instance e.g. remDr$findElement("css", "body"). In your case, you are choosing to transfer the html across into something which you can call rvest's html_nodes() on i.e.
either a document, a node set or a single node.. As the transfer is html, then read_html() is needed to generate a document for parsing.
The error inside the attempt to form a data.frame call is because you need to implement handling of missing child nodes i.e. where certain prices are missing.

Scraping links in df columns with rvest

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?
This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)
library(rvest)
df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")
df$links_text <- lapply(df$testo, function(x) {
page <- read_html(x)
links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
html_text(trim = T)
})
df$text <- lapply(df$links_text, function(x) {
page1 <- read_html(x)
links1 <- html_nodes(page, 'p') %>%
html_text(trim = T)
})

You want links1 <- html_nodes(page, 'p') to refer to page1, not page.
[Otherwise (as there is no object page in the function environment, it is trying to apply html_nodes to the utils function page]
In terms of Sys_sleep, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!
You can just include Sys.sleep(n) in your function where you create df$text. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!

You may do this in single sapply command and use tryCatch to handle errors.
library(rvest)
df$text <- sapply(df$testo, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('.value:nth-child(8) .fixed') %>%
html_text(trim = T) %>%
read_html %>%
html_nodes('p') %>%
html_text(trim = T) %>%
toString()
}, error = function(e) NA)
})

Web Scraping: Xpath Code Returns "Script out of bounds message" in R

I'm trying to scrape a table in a for loop, however the code returns the following error, "Error in Raw_Dividends[[1]] : subscript out of bounds". I have a suspicion that the Xpath might not be correct, however I've tried many similar xpath codes from the HTML code and none have worked. Any suggestions would be appreciated. Code as below:
library(jsonlite)
library(rvest)
url = "https://www.dividenddata.co.uk/ftse-dividend-history.py?market=ftse100"
xpath = '/html/body/section/div[3]/div[1]/div/table'
url_html <- read_html(url)
Stock_Table <- url_html %>% html_nodes(xpath = xpath) %>% html_table(fill = T)
Stock_Table <- Stock_Table[[1]]
Stock_Table <- Stock_Table[,c(1,2)]
Dividends <- data.frame()
for (i in 1:nrow(Stock_Table)) {
url = paste0("https://www.dividenddata.co.uk/dividendhistory.py?epic=",Stock_Table[1,1])
xpath = '/html/body/section/div[3]/div/div/div[3]/div[1]/div'
Raw_Dividends <- url_html %>% html_nodes(xpath = xpath) %>%
html_table(fill = T)
Raw_Dividends <- Raw_Dividends[[1]]
Raw_Dividends$Symbol <- rep(Stock_Table[1,i],nrow(Stock_Table))
Dividends <- rbind(Dividends, Raw_Dividends)
}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping through multiple websites - r

Related

Why is my loop not working when I webscrape on R?

R: Errors when webscraping across mulitple tables with same URL

R: webscraping returns no applicable method for 'xml_find_all' applied to an object of class "character"?

Scraping links in df columns with rvest

Web Scraping: Xpath Code Returns "Script out of bounds message" in R

Categories

Resources