Scraping links in df columns with rvest

Scraping links in df columns with rvest - r

I have a dataframe where one of the columns contains the links to webpages I want to scrape with rvest. I would like to download some links, store them in another column, and download some texts from them. I tried to do it using lapply but I get Error in UseMethod("xml_find_all") : no applicable method for 'xml_find_all' applied to an object of class "function" at the second step. Maybe the problem could be that the first links are saved as a list. Do you know how I can solve it?
This is my MWE (in my full dataset I have around 5000 links, should I use Sys.sleep and how?)
library(rvest)
df <- structure(list(numeroAtto = c("2855", "2854", "327", "240", "82"
), testo = c("http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.2855.18PDL0127540",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.327.18PDL0003550",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.240.18PDL0007740",
"http://dati.camera.it/ocd/versioneTestoAtto.rdf/vta18_leg.18.pdl.camera.82.18PDL0001750"
)), row.names = c(NA, 5L), class = "data.frame")
df$links_text <- lapply(df$testo, function(x) {
page <- read_html(x)
links <- html_nodes(page, '.value:nth-child(8) .fixed') %>%
html_text(trim = T)
})
df$text <- lapply(df$links_text, function(x) {
page1 <- read_html(x)
links1 <- html_nodes(page, 'p') %>%
html_text(trim = T)
})

You want links1 <- html_nodes(page, 'p') to refer to page1, not page.
[Otherwise (as there is no object page in the function environment, it is trying to apply html_nodes to the utils function page]
In terms of Sys_sleep, it is fairly optional. Check in the page html and see whether there is anything in the code or user agreement prohibiting scraping. If so, then scraping more kindly to the server might improve your chances of not getting blocked!
You can just include Sys.sleep(n) in your function where you create df$text. n is up to you, I've had luck with 1-3 seconds, but it does become pretty slow/long!

You may do this in single sapply command and use tryCatch to handle errors.
library(rvest)
df$text <- sapply(df$testo, function(x) {
tryCatch({
x %>%
read_html() %>%
html_nodes('.value:nth-child(8) .fixed') %>%
html_text(trim = T) %>%
read_html %>%
html_nodes('p') %>%
html_text(trim = T) %>%
toString()
}, error = function(e) NA)
})

Related

Why is my loop not working when I webscrape on R?

library(tidyverse)
library(rvest)
library(htmltools)
library(xml2)
library(dplyr)
#import using read_html
results <- read_html("https://www.artemis.bm/deal-directory/")
#Results are in a list - can look at list by running 'results' in the console
CODE - The issuers information extracted
issuers <- results %>% html_nodes("#table-deal a") %>% html_text()
cedent <- results %>% html_nodes("td:nth-child(2)") %>% html_text()
risks <- results %>% html_nodes("td:nth-child(3)") %>% html_text()
size <- results %>% html_nodes("td:nth-child(4)") %>% html_text()
date <- results %>% html_nodes("td:nth-child(5)") %>% html_text()
#This scrapes all of the links for each issuer page
url <- results %>% html_nodes("#table-deal a") %>% html_attr("href")
#getting data from within the links
get_placement = function(url_link) {
url_link = read_html("https://www.artemis.bm/deal-directory/cape-lookout-re-ltd-series-2022-1/")
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
}
This code works and the last bit from get_placement gets the information I am after (the placement section) - whichever link I put in it gives me the placement for that informational. However, when i try and loop it it does not work
#here is my issue
get_placement = function(url_link) {
issuer_page = read_html(url_link)
placement = issuer_page %>% html_nodes("#info-box li:nth-child(3)") %>%
html_text()
return(placement)
}
This only gives me one value when I need the placement information from all 833?
issuer_placement = sapply(url, FUN = get_placement)
When I try to use sapply I get this message
Browse[1]> issuer_placement = sapply(url, FUN = get_placement)
Error during wrapup: no applicable method for 'read_xml' applied to an object of class "name"
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
and
function (con, open = "r", blocking = TRUE, ...)
.Internal(open(con, open, blocking))

This worked for me without any problem
issue_placement <- lapply(url, function(u) {
tryCatch(return(get_placement(u)),
error=function(e) return("Not retrieved - error"),
warning=function(w) return("Not retrieved - warning"))
})
When I pushed issue_placement into a data.table (see below), I found found 330 unique results, and no errors/warnings
data.table::data.table(placement = unlist(issue_placement))[,.N, placement]

Rvest: Building a queue for several html-links

I'm currently webscraping a news magazine, but unfortunately, I don't have a clue on how to establish a working queue. I can only scrape the content of all articles on one page, but I want a queue, that automatically does the same thing for the rest of articles.
library(rvest)
library(tidyverse)
library(data.table)
library(plyr)
library(writexl)
map_dfc(.x = c("em.entrylist__title", "time.entrylist__time"),
.f = function(x) {read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes(x) %>%
html_text()}) %>%
bind_cols(url = read_html("https://www.sueddeutsche.de/news/page/1?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020") %>%
html_nodes("a.entrylist__link") %>%
html_attr("href")) %>%
setNames(nm = c("title", "time", "url")) -> temp
map_df(.x = temp$url[1:50],
.f = function(x){tibble(url = x,
text = read_html(x) %>%
html_nodes("#article-app-container > article > div.css-isuemq.e1lg1pmy0 > p:nth-child(n)") %>%
html_text() %>%
list
)}) %>%
unnest(text) -> foo
foo
X2 <- ddply(foo, .(url), summarize,
Xc=paste(text,collapse=","))
final <- merge(temp, X2, by="url")
In this case, I got 30 pages filled with articles, but my scripts only supports the scraping of one page.
The only thing that changes in between the pages is the page number (https://www.sueddeutsche.de/news/**page/1**?search=...)
If you could give me a hint on how to include all pages into the queue at once, I would be more than grateful. Thanks a lot :)

How would a queue in dataframe form work for you?
The following suggestion is held a little more generic, so it'll work beyond the specific use-case. You would be able to add more URLs to scrape as you go, but only new ones will be kept, due to dplyr::distinct.
(I've initiated the queue to hold the first 5 pages you want to scrape, you can add more right away or dynamically if you find links on the DOM...)
library(dplyr)
library(lubridate)
queue <- tibble(
url = paste0("https://www.sueddeutsche.de/news/page/", 1:5, "?search=Corona&sort=date&all%5B%5D=dep&all%5B%5D=typ&all%5B%5D=sys&time=2020-07-19T00%3A00%2F2020-07-27T23%3A59&startDate=27.07.2020&endDate=01.08.2020"),
scraped_time = lubridate::NA_POSIXct_
)
results <- list()
while(length(open_rows <- which(is.na(queue$scraped_time))) > 0) {
i <- open_rows[1]
url <- queue$url[i]
[...]
results[[url]] <- <YOUR SCRAPING RESULT>
queue$scraped_time[i] <- lubridate::now()
if (<MORE PAGES TO QUEUE>) {
queue <- queue %>%
tibble::add_row(url = c('www.spiegel.de', 'www.faz.de')) %>%
arrange(desc(scraped_time)) %>%
distinct(url, .keep_all = T)
}
}

How to fix ' Error in UseMethod("xml_find_all") ' while trying to webscrape using rvest

I'm a R beginner and I'm trying to write a function to scrape all the song lyrics from a certain singer from a website, returning a tibble with the lyrics and the name of the song. I already managed to get all the songs links, but I'm stuck trying to write a function to actually get the lyrics.
website in question is: https://www.letras.mus.br/belchior/44457/
selector for the song title: #js-lyric-cnt > article > div.cnt-head.cnt-head--l > div.cnt-head_title > h1
selector for the song lyrics: #js-lyric-cnt > article > div.cnt-letra-trad.g-pr.g-sp > div.cnt-letra.p402_premium
I wrote this function:
get_lyrics <- function(url){
url %>% read_html() %>%
um <- html_nodes('#js-lyric-cnt > article > div.cnt-letra-trad.g-pr.g-sp > div.cnt-letra.p402_premium')
um %>%
lyrics <- html_text()
url %>% read_html() %>%
dois <- html_nodes('#js-lyric-cnt > article > div.cnt-head.cnt-head--l > div.cnt-head_title > h1')
dois %>%
title <- html_text()
data_frame(title, lyrics)
}
But when I try to run I get:
get_lyrics('https://www.letras.mus.br/belchior/1391391/')
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
I'm not sure what I can do to fix it so I appreciate the help.

You could shorten your selectors (generally faster and more stable). read_html only once then work with the retrieved content. I assume (eek) - you want a dataframe with 1 entry for the title and 1 corresponding entry for the lyrics. The lyrics are within p tags within parent element with class cnt-letra; furthermore, individual lyric lines are br tag separated. In order to preserve a sense of the original lyrics line spacing when parsing to a single string I add '\n' to account for these breaks.
I got the functions necessary to work around the lack of br handling in rvest from #rentrop here - though as that issue is quite old perhaps I have missed the addition of this feature?
Be careful about the sequencing you use when chaining methods to ensure the flow is as intended.
library(rvest)
library(magrittr)
html_text_collapse <- function(x, trim = FALSE, collapse = "\n"){
UseMethod("html_text_collapse")
}
html_text_collapse.xml_nodeset <- function(x, trim = FALSE, collapse = "\n"){
vapply(x, html_text_collapse.xml_node, character(1), trim = trim, collapse = collapse)
}
html_text_collapse.xml_node <- function(x, trim = FALSE, collapse = "\n"){
paste(xml2::xml_find_all(x, ".//text()"), collapse = collapse)
}
get_lyrics <- function(url){
page <- read_html(url)
lyrics <- toString(page %>% html_nodes('.cnt-letra p') %>% html_text_collapse)
title <- page %>% html_node('.cnt-head_title') %>% html_text()
return(data.frame(title, lyrics))
}
get_lyrics('https://www.letras.mus.br/belchior/44457/')

If the goal is to just get the lyrics you can use the genius package.
genius::genius_lyrics("Belchior", "Na Hora do Almoco") will fetch the lyrics.

Sreality.cz web scraping

I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")

I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}

Rvest, looping through elements on a page in order to follow a link at each element?

So I'm trying to scrape data from a site that contains club data from clubs at my school. I've got a good script going that scrapes the surface level data from the site, however I can get more data by clicking the "more information" link at each club which leads to the club's profile page. I would like to scrape the data from that page (specifically the facebook link).
Below you'll see my current attempt at this.
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
get_table <- function(page, count) {
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
category_nodes <- html_nodes(page, "div.grpl-grp") %>% html_node(".grpl-type") %>% html_text()
df$category<-category_nodes
pic_nodes <-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-logo img") %>% html_attr("src")
df$logo <- paste0("https://uws-community.symplicity.com/", pic_nodes)
more_info_nodes <- html_nodes(page, ".grpl-moreinfo a") %>% html_attr("href")
df$more_info <- paste0("https://uws-community.symplicity.com/", more_info_nodes)
sub_page <- page %>% follow_link(css = ".grpl-moreinfo a")
df$fb <- html_node(sub_page, xpath = '//*[#id="dnf_class_values_student_group__facebook__widget"]') %>% html_text()
if(count != 44) {
return (rbind(df, get_table(page %>% follow_link(css = ".paging_nav a:last-child"), count + 1)))
} else{
return (df)
}
}
RSO_data <- get_table(page, 0)
The current error I'm getting is:
Error in `$<-.data.frame`(`*tmp*`, "logo", value = "https://uws-community.symplicity.com/") :
replacement has 1 row, data has 0
I know I need to make a function that will go through each element and follow the link, then mapply that function to the dataframe df. However I don't know how I'd go about making that function so that it would work correctly.

your error says that you are trying to combine two different dimensions... your page variable already has one dimension and second is 0. page <- html_session(url) add this inside you function.

This is a reproducable example of your error message.
x = data.frame()
x[1] <- c(1)
I haven't checked your code, but the error is in there, you have to go step by step through your code. You will find the error, where you've created an empty data.frame and then tried to assign a value to it.
good luck

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping links in df columns with rvest - r

Related

Why is my loop not working when I webscrape on R?

Rvest: Building a queue for several html-links

How to fix ' Error in UseMethod("xml_find_all") ' while trying to webscrape using rvest

Sreality.cz web scraping

Rvest, looping through elements on a page in order to follow a link at each element?

Categories

Resources