How to extract desired data from rvest::html_text - web-scraping

I am having trouble extracting the price element from the website:
"https://www.eventbrite.com/" using rvest
I have located the selector with Select Gadget and have the following minimal selector ".eds-l-mar-top-1" which I have used to locate the price. I have tried saving the xml data as a dataframe but I get the following error message:
Error in as.data.frame.default(page_html) :
cannot coerce class ‘c("xml_document", "xml_node")’ to a data.frame
I have tried to filter the price with:
price <- page_html %>% html_nodes('js-display-price') %>% html_text()
but price is empty.
getYear = "2019"
getWeek = "31"
base_url = "https://www.eventbrite.com/"
query_params = list(yr=getYear, wk=getWeek)
resp <- GET(url=base_url, query=query_params)
page_html <- read_html(resp)
# price included in the details of the following tag
page_html %>%
html_nodes(".eds-l-mar-top-1") %>%
html_text(trim = TRUE)
I would like to extract the following data:
Name and Date of Event and price

I see content which is dynamically loaded but is present within a javascript object elsewhere in the response. You can regex out the object and handle with a json parser.
library(httr)
library(rvest)
library(stringr)
getYear = "2019"
getWeek = "31"
base_url = "https://www.eventbrite.com/"
query_params = list(yr=getYear, wk=getWeek)
resp <- GET(url=base_url, query=query_params)
r <- read_html(resp) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'window\\.__SERVER_DATA__ = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$suggestions$events$ticket_availability)
print(json$suggestions$events)

Related

Rvest Pulls Empty Tables

The site I use to scrape data has changed and I'm having issues pulling the data into table format. I used two different types of codes below trying to get the tables, but it is returning blanks instead of tables.
I'm a novice in regards to scraping and would appreciate the expertise of the group. Should I look for other solutions in rvest, or try to learn a program like rSelenium?
https://www.pgatour.com/stats/detail/02675
Scrape for Multiple Links
library("dplyr")
library("purr")
library("rvest")
df23 <- expand.grid(
stat_id = c("02568","02674", "02567", "02564", "101")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/detail/',
stat_id
)
) %>%
as_tibble()
#replaced tournament_id with stat_id
get_info <- function(link, stat_id){
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]]
}
test_main_stats <- df23 %>%
mutate(tables = map2(links, stat_id, possibly(get_info, otherwise = tibble())))
test_main_stats <- test_main_stats %>%
unnest(everything())
Alternative Code
url <- read_html("https://www.pgatour.com/stats/detail/02568")
test1 <- url %>%
html_nodes(".css-8atqhb") %>%
html_table
This page uses javascript to create the table, so rvest will not directly work. But if one examines the page's source code, all of the data is stored in JSON format in a "<script>" node.
This code finds that node and converts from JSON to a list. The variable is the main table but there is a wealth of other information contained in the JSON data struture.
#read page
library(rvest)
page <- read_html("https://www.pgatour.com/stats/detail/02675")
#find the script with the correct id tage, strip the html code
datascript <- page %>% html_elements(xpath = ".//script[#id='__NEXT_DATA__']") %>% html_text()
#convert from JSON
output <- jsonlite::fromJSON(datascript)
#explore the output
str(output)
#get the main table
answer <-output$props$pageProps$statDetails$rows

R: webscraping returns no applicable method for 'xml_find_all' applied to an object of class "character"?

I'm using RSelenium and purrr functions to generate a df with all the products in this page and their prices:
https://www.lacuracao.pe/curacao/tv-y-audio/televisores
I'm getting this error, why?
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
Code:
library(RSelenium)
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
#start RSelenium
rD <- rsDriver(port = 4560L, browser = "chrome", version = "3.141.59", chromever = "93.0.4577.63",
geckover = "latest", iedrver = NULL, phantomver = "2.1.1",
verbose = TRUE, check = TRUE)
remDr <- rD[["client"]]
Sys.sleep(10)
tvs_url <- "https://www.lacuracao.pe/curacao/tv-y-audio/televisores"
remDr$navigate(tvs_url)
Sys.sleep(10)
#scroll down 20 times, waiting for the page to load at each time
for(i in 1:20){
remDr$executeScript(paste("scroll(0,",i*10000,");"))
Sys.sleep(5)
}
h<-remDr$getPageSource()
df <- map_dfr(h %>%
map(~ .x %>%
html_nodes("div.product")), ~
data.frame(
periodo = lubridate::year(Sys.Date()),
fecha = Sys.Date(),
ecommerce = "lacuracao",
producto = .x %>% html_node(".product_name") %>% html_text(),
precio.antes = .x %>% html_node('.old-price') %>% html_text(),
precio.actual = .x %>% html_node('#offerPriceValue') %>% html_text()
))
Update 1:
I've changed h<-remDr$getPageSource() to h<-remDr$getPageSource()[[1]] and now class(h) returns character.
Update 2:
Tried:
h<-remDr$getPageSource()[[1]]
hh <- h %>% read_html() %>% html_elements("div.product")
class(hh) #[1] "xml_nodeset"
But getting this when trying to form the df:
Error in data.frame(periodo = lubridate::year(Sys.Date()), fecha = Sys.Date(), :
arguments imply differing number of rows: 1, 0
Use remDr$getPageSource()[[1]] to get the actual document.
You then need to pipe that to your DOM parser i.e. remDr$getPageSource()[[1]] %>% read_html() and continue on as before i.e. ...%>% html_elements(.....).
RSelenium has its own methods for selecting elements via the Webdriver instance e.g. remDr$findElement("css", "body"). In your case, you are choosing to transfer the html across into something which you can call rvest's html_nodes() on i.e.
either a document, a node set or a single node.. As the transfer is html, then read_html() is needed to generate a document for parsing.
The error inside the attempt to form a data.frame call is because you need to implement handling of missing child nodes i.e. where certain prices are missing.

Need help scraping a big archive

For a schoolproject i have to scrape a website which isn't a problem. But for it to be called BigData i wanted to scrape the whole archive(which is the past 5 years). The only thing that changes in the url is the date at the end of the url but i don't know how to write a script that changes only the date at the end.
The website I'm using is this: https://www.ongelukvandaag.nl/archief/ .
And the dates i need are from 01-01-2015 until 24-09-2020. The first part of the code i already figured out and I'm able to scrape 1 page. I'm a beginner at using R and would like to know if anyone could help me. The code is shown below. Thanks in advance!
This is what i got so far and the errors are underneath the code.
install.packages("XML")
install.packages("reshape")
install.packages("robotstxt")
install.packages("Rcrawler")
install.packages("RSelenium")
install.packages("devtools")
install.packages("exifr")
install.packages("Publish")
devtools::install_github("r-lib/xml2")
library(rvest)
library(dplyr)
library(xml)
library(stringr)
library(jsonlite)
library(xml12)
library(purrr)
library(tidyr)
library(reshape)
library(XML)
library(robotstxt)
library(Rcrawler)
library(RSelenium)
library(ps)
library(devtools)
library(exifr)
library(Publish)
#Create an url object
url<-"https://www.ongelukvandaag.nl/archief/%d "
#Verify the web can be scraped
paths_allowed(paths = c(url))
#Obtain the links for every day from 2015 to 2020
map_df(2015:2020, function(i){
page<-read_html(sprintf(url,i))
data.frame(Links = html_attr(html_nodes(page, ".archief a"),"href"))
}) -> Links %>%
Links$Links<-paste("https://www.ongelukvandaag.nl/",Links$Links,sep = "")
#Scrape what you want from each link:
d<- map(Links$Links, function(x) {
Z <- read_html(x)
Date <- Z %>% html_nodes(".text-muted") %>% html_text(trim = TRUE) # Last update
All_title <- Z %>% html_nodes("h2") %>% html_text(trim = TRUE) # Title
return(tibble(All_title,Date))
})
The errors i get:
Error in open.connection(x, "rb") : HTTP error 400.
in paste("https://www.ongelukvandaag.nl/", Links$Links, sep = "") : object 'Links' not found >
in map(Links$Links, function(x) { : object 'Links' not found
and the packages "xml12" & "xml" don't work in this version of RStudio
Take a look at my code and my comments:
library(purrr)
library(rvest) # don't load a lot of libraries if you don't need them
url <- "https://www.ongelukvandaag.nl/archief/"
bigdata <-
map_dfr(
2015:2020,
function(year){
year_pg <- read_html(paste0(url, year))
list_dates <- year_pg %>% html_nodes(xpath = "//div[#class='archief']/a") %>% html_text() # in case some dates are missing
map_dfr(
list_dates,
function(date) {
pg <- read_html(paste0(url, date))
items <- pg %>% html_nodes("div.full > div.row")
items <- items[sapply(items, function(x) length(x %>% html_node(xpath = "./descendant::h2"))) > 0] # drop NA items
data.frame(
date = date,
title = items %>% html_node(xpath = "./descendant::h2") %>% html_text(),
update = items %>% html_node(xpath = "./descendant::h4") %>% html_text(),
image = items %>% html_node(xpath = "./descendant::img") %>% html_attr("src")
)
}
)
}
)

Web Scraping in R using rvest and finding the html_note

I am trying to find the current html_note to fetch the replies count for each post in this forum: https://d.cosx.org/. I used CSS selector and it said .DiscussionListItem-count but it seems not working.
My code:
library(rvest)
library(tidyverse)
COS_link <- read_html("https://d.cosx.org/")
COS_link %>%
# The relevant tag
html_nodes(css = '.DiscussionListItem-count') %>%
html_text()
I would like to fetch the replies count, for example: 1k for 1st post and 30 for 2nd post. I am wondering if I miss something or anyone has a better idea?
You can use the API and parse the json response for the title and participantCount attributes
API endpoint returning that info is:
https://d.cosx.org/api
Substring the response to remove the trailing 0 and leading ac76 then parse with a json library of choice.
Less optimal is to regex out the json string from original url
library(rvest)
library(jsonlite)
library(stringr)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
counts <- json$resources$attributes$participantCount
For those wishing to pair up the title with count and who don't have chinese settings a colleague helped me write the following:
library(rvest)
library(jsonlite)
library(stringr)
library(corpus)
url <- "https://d.cosx.org/"
r <- read_html(url) %>%
html_nodes('body') %>%
html_text() %>%
toString()
x <- str_match_all(r,'flarum\\.core\\.app\\.load\\((.*)\\);')
json <- jsonlite::fromJSON(x[[1]][,2])
titles <- json$resources$attributes$title
counts <- json$resources$attributes$participantCount
cf <- corpus_frame(name = titles, text = counts)
names(cf) <- c("titles", "counts")
print(cf[which(!is.na(cf$counts)),], 100)

Sreality.cz web scraping

I have tried scraping data from a real estate site, and arranging the data in a way that can then easily be filtered and checked using a spreadsheet. I’m actually a little embarrassed that i don’t move of this R code forward.
Now that i have all the links to the posts, i can not now loop through the previously compiled dataframe and get the details from all the URLs.
Could you just please help me with it? Thanks a lot.
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(xml2)
complete <- data.frame()
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
URL.base <- "https://www.sreality.cz/hledani/prodej/byty?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=dnes&strana="
#"https://www.sreality.cz/hledani/prodej/byty/praha?stari=tyden&strana="
for (i in 1:10000) {
#Specifying the url for desired website to be scrapped
main_link<- paste0(URL.base, i)
# go to website
remDr$navigate(main_link)
# get page source and save it as an html object with rvest
main_page <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the data
name <- html_nodes(main_page, css=".name.ng-binding") %>% html_text()
locality <- html_nodes(main_page, css=".locality.ng-binding") %>% html_text()
norm_price <- html_nodes(main_page, css=".norm-price.ng-binding") %>% html_text()
sreality_url <- main_page %>% html_nodes(".title") %>% html_attr("href")
sreality_url2 <- sreality_url[c(4:24)]
name2 <- name[c(4:24)]
record <- data.frame(cbind(name2, locality, norm_price, sreality_url2))
complete <- rbind(complete, record)
}
# Write CSV in R
write.csv(complete, file = "MyData.csv")
I would do this differently:
I would create a function, say 'scraper', that groups up together all the scraping functions you have already defined, doing so I'll create a list with the str_c of all the possibile links (say 30), after that a simple lapply function. As it all said, I will not use Rselenium. (libraries: rvest , stringr , tibble, dplyr )
url = 'https://www.sreality.cz/hledani/prodej/byty?strana='
here it is the URL base, starting from here you should be able to replicate the URL strings for all the pages (1 to whichever) you are interested in (and for all the possible url, for praha, olomuc, ostrava etc ).
main_page = read_html('https://www.sreality.cz/hledani/prodej/byty?strana=')
here you create all the linnks according to the number of pages you want:
list.of.pages = str_c(url, 1:30)
then define a single function for all the single data you are interested, in this way you are more precise and your error debug is easier, as well as the data quality. (I assume your CSS selections are right, otherwise you will obtain empty obj)
for names
name = function(url) {
data = html_nodes(url, css=".name.ng-binding") %>%
html_text()
return(data)
}
for locality
locality = function(url) {
data = html_nodes(url, css=".locality.ng-binding") %>%
html_text()
return(data)
}
for normprice
normprice = function(url) {
data = html_nodes(url, css=".norm-price.ng-binding") %>%
html_text()
return(data)
}
for hrefs
sreality_url = function(url) {
data = html_nodes(url, css=".title") %>%
html_attr("href")
return(data)
}
those are the single fuctions (the CSS selection, even if i didnt test them, seem to be not correct to me, but this will give you the right framework to work on). After that combine them into a tibble obj
get.data.table = function(html){
name = name(html)
locality = locality(html)
normprice = normprice(html)
hrefs = sreality_url(html)
combine = tibble(adtext = name,
loc = locality,
price = normprice,
URL = sreality_url)
combine %>%
select(adtext, loc, price, URL) return(combine)
}
then the final scraper:
scrape.all = function(urls){
list.of.pages %>%
lapply(get.data.table) %>%
bind_rows() %>%
write.csv(file = 'MyData.csv')
}

Resources