Scraping pages with inconsistent lengths in dataframe - r

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`
How can I let my code run but fill with Na's in tibble if the data isn't there.
My code for a pauzing robot later used in scraper function:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
Scraper:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
Total number of pages 445, but for example sake only scraping three:
huisartsen <- map_df(sample(1:3), scrape_page)
Page 2 seems to be the problem with inconsistent lengths because this code works:
huisartsen <- map_df(3:4, scrape_page)
If possible with tidyverse code. Thanks in advance.

You need to retrieve the list of parent nodes
parents <- page %>% html_nodes("li.media")
Then parse the parent nodes with function html_node().
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
The html_node function will always return a value even if it is just a NA

Related

Extracting repeated class with rvest html_elements in R

how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)

Trouble mapping a function to a list of scraped links using rvest

I am trying to apply a function that extracts a table from a list of scraped links. I am at the final stage where I am applying the get_injury_data function to the links - I have been having issues with successfully executing this. I get the following error:
Error in matrix(unlist(values), ncol = width, byrow = TRUE) :
'data' must be of a vector type, was 'NULL'
I wonder if anyone can help me spot where I am going wrong. The code is as follows:
library(tidyverse)
library(rvest)
# create a function to grab the team links
get_team_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(!grepl('profil', links)) %>% # remove link of players included
filter(!grepl('spielplan', links)) %>% # remove link of additional team pages included
mutate(links = gsub("startseite", "kader", links)) # change link to go to the detailed page
}
# create a function to grab the player links
get_player_links <- function(url){
url %>%
read_html %>%
html_nodes('td.hauptlink a') %>%
html_attr('href') %>%
.[. != '#'] %>% # remove rows with # string
paste0('https://www.transfermarkt.com', .) %>% # pat the website link to the url strings
unique() %>% # keep only unique links
as_tibble() %>% # turn strings into a tibble datatset
rename("links" = "value") %>% # rename the value column
filter(grepl('profil', links)) %>% # remove link of players included
mutate(links = gsub("profil", "verletzungen", links)) # change link to go to the injury page
}
# create a function to get the injury dataset
get_injury_data <- function(url){
url %>%
read_html() %>%
html_nodes('#yw1') %>%
html_table()
}
# get team links and save it as team_links
team_links <- get_team_links('https://www.transfermarkt.com/premier-league/startseite/wettbewerb/GB1')
# get player links and by mapping the function on to the player_injury_links dataset
# and then unnest the list of lists as a long list
player_injury_links <- team_links %>%
mutate(links = map(team_links$links, get_player_links)) %>%
unnest(links)
# using the player_injury_links list create a dataset by web scrapping the play injury pages
player_injury_data <- map(player_injury_links$links, get_injury_data)
Solution
So the issue that I was having was that some of the links that I was scraping did not have any data.
To overcome this issue used, I used the possibly function from purrr package. This helped me create a new, error-free function.
The line code that was giving me trouble is as follows:
player_injury_data <- player_injury_links %>%
purrr::map(., purrr::possibly(get_injury_data, otherwise = NULL, quiet = TRUE))

Web Scraping: R rvest html_nodes return character(0)

Below is the code. The strange thing is that I could run it without problems yesterday, but today always returns character(0). After checking through it, I found it's because of the html_nodes line.
I try to replace ".photo-cards li article" with other nodes, it's still not working.
Did somebody meet the same problem and fix it? Thank you in advance for your help!
library(tidyverse)
library(rvest)
links <- sprintf("https://www.zillow.com/sacramento-ca/%d_p", 1:11)
results <- map(links, ~ {
# http://selectorgadget.com/
# <body
# class="photo-cards
houses <- read_html(.x) %>%
html_nodes(".photo-cards li article")
z_id <- houses %>%
html_attr("id")
address <- houses %>%
html_node(".list-card-addr") %>%
html_text()
price <- houses %>%
html_node(".list-card-price") %>%
html_text() %>%
readr::parse_number()
params <- houses %>%
html_node(".list-card-info") %>%
html_text()
# number of bedrooms
beds <- params %>%
str_extract("\\d+(?=\\s*bds)") %>%
as.numeric()
# number of bathrooms
baths <- params %>%
str_extract("\\d+(?=\\s*ba)") %>%
as.numeric()
# total square footage
house_a <- params %>%
str_extract("[0-9,]+(?=\\s*sqft)") %>%
str_replace(",", "") %>%
as.numeric()
tibble(price = price, beds= beds, baths=baths, house_area = house_a)
}
) %>%
bind_rows(.id = 'page_no')

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

scraping Q&A works fine except when there's more than one page of answers for one post

the following code scrapes all questions and answers with their authors and dates, but I cannot figure out how to scarpe also answers that take more than one page, for example for the second question here
https://www.healthboards.com/boards/aspergers-syndrome/index2.html
Asperger's and talking to yourself
Answers are in 2 pages: 15 in the firt page and 3 in the second, I'm getting the answers in the first page only
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)
library(xlsx)
#install.packages("xlsx")
# Scrape thread titles, thread links, authors and number of views
url <- "https://www.healthboards.com/boards/aspergers-syndrome/index2.html"
h <- read_html(url)
threads <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_text()
threads
thread_links <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_attr(name = "href")
thread_links
thread_starters <- h %>%
html_nodes("#threadslist .alt1 div.smallfont") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "")
thread_starters
views <- h %>%
html_nodes(".alt2:nth-child(6)") %>%
html_text() %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric()
# Custom functions to scrape author IDs and posts
scrape_posts <- function(link){
read_html(link) %>%
html_nodes(css = ".smallfont~ hr+ div") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_dates <- function(link){
read_html(link) %>%
html_nodes(css = "table[id^='post'] td.thead:first-child") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_author_ids <- function(link){
h <- read_html(link) %>%
html_nodes("div")
id_index <- h %>%
html_attr("id") %>%
str_which(pattern = "postmenu")
h %>%
`[`(id_index) %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
htmls <- map(thread_links, getURL)
# Create master dataset
master_data <-
tibble(threads, thread_starters,thread_links) %>%
mutate(
post_author_id = map(htmls, scrape_author_ids),
post = map(htmls, scrape_posts),
fec=map(htmls, scrape_dates)
) %>%
select(threads: post_author_id, post, thread_links,fec) %>%
unnest()
master_data$thread_starters
threads
post
titles<-master_data$threads
therad_starters<-master_data$thread_starters
#views<-master_data$views
post_author<-master_data$post_author_id
post<-master_data$post
fech<-master_data$fec
employ.data <- data.frame(titles, therad_starters, post_author, post,fech)
write.xlsx(employ.data, "C:/2.xlsx")
Can't figure out how to include also the second page..
Taking a quick look at your code and the website, there is a td under class vbmenu_control which holds the number of pages (in your case, page 2 of 2). You could use some simple regex such as
a = "page 2 of 2"
b = as.numeric(gsub("page 2 of ","",a))
> b
[1] 2
And add a conditional if b>1. If this is TRUE, you can loop-scrape through the link ...-talking-yourself-i.html with i being the values from the sequence 1 to the number of pages.

Resources