R Web Scraping - data frame - r

I am trying to create a data frame with the following variables. However, after using the SelectorGadget tool to determine the CSS selector needed to scrape this information, the vectors yield different values. Even after copying the selector straight from the HTML source code. If done correctly, this table should have 34 rows. Here it is my code and the respective error:
womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")
womens_opponents <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-name a") %>%
html_text()
womens_locations <- womens_bb %>%
html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
html_text()
womens_dates <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
html_text()
womens_times <- womens_bb %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
html_text()
as.numeric()
womens_scores <- womens_bb %>%
html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
html_text()
as.numeric()
womens_win_loss <- womens_bb %>%
html_nodes(".text-italic span:nth-child(2)") %>%
html_text() %>%
str_replace("\\,", "")
womens_df <- data_frame(
date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)
Error: Columns `date`, `time`, `opponent`, `score`, `win_loss` must be length 1 or 37, not 36, 36, 34, 34, 35
How can I resolve this issue?

I think there are some issue with the img tag. So to avoid, these you can first gather the global div tag ( 36 when i do the script), and do a loop inside to get your result. And perform a little if controle on tag that appear weirds :
womens_bb <- read_html("http://gomason.com/schedule.aspx?path=wbball")
divs <- womens_bb %>% html_nodes(".sidearm-schedule-game")
for (div in divs){
womens_opponents <- div %>%
html_nodes(".sidearm-schedule-game-opponent-name, .sidearm-schedule-game-opponent-name a") %>%
html_text
womens_opponents <- gsub("\\s{2,}","",womens_opponents[1])
womens_locations <- div %>%
html_nodes(".sidearm-schedule-game-location span:nth-child(1)") %>%
html_text()
womens_locations <- womens_locations[1]
womens_dates <- div %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(1)") %>%
html_text()
womens_times <- div %>%
html_nodes(".sidearm-schedule-game-opponent-date span:nth-child(2)") %>%
html_text()
womens_scores <- div %>%
html_nodes("div.sidearm-schedule-game-result span:nth-child(3)") %>%
html_text()
if(length(womens_scores)==0) womens_scores = ""
womens_win_loss <- div %>%
html_nodes(".text-italic span:nth-child(2)") %>%
html_text()
womens_win_loss <- gsub("\\,", "",womens_win_loss)
res <- c(date = womens_dates, time = womens_times, opponent = womens_opponents, location = womens_locations, score = womens_scores, win_loss = womens_win_loss)
print(length(res))
df <- rbind(df,res)
}
Hope that will helps,
Gottavianoni

Related

received Error in open.connection(x, "rb") : HTTP error 404. after running a for-loop in r

While trying to scrape information from several links, I got the error: Error in open.connection(x, "rb") : HTTP error 404.
I feel like it has something to do with the first part of my for-loop, so I tried changing numbers from character to numeric, but that did not fix the problem. I also tried advice here, however, it returned more problems.
Think you can spot where I went wrong?
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in length(numbers)){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>% html_nodes('.category-title__age-division') %>% html_text()
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, data.frame(matches))
}
I also ran this, but it did not return the data frame for the scraped data. Instead it printed the results on the screen instead
map_df(get_links, function(i){
url <- read_html(i)
matches <- data.frame(ageDivision <- url %>%
html_nodes('.category-title__age-division') %>% html_text(),
gender <- url %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text() )
master1.tree <- rbind(master1.tree, matches)
})
Here is an alternative to your code. First, it's not necessary to extract the numbers. You can directly loop over the vector get_links. Second, I use purrr::map_df for the looping part which is a more concise way than using the for loop. To this end I use a custom function to scrape one of your pages. Finally, I use trim=TRUE with html_text to remove the leading and trailing white space:
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
scrape_page <- function(url) {
html <- read_html(url)
data.frame(
division = html %>% html_nodes('.category-title__age-division') %>% html_text(trim = TRUE),
gender = html %>% html_nodes('.category-title__age-division+ .category-title__label') %>% html_text(trim = TRUE)
)
}
master1.tree <- purrr::map_df(get_links[1:5], scrape_page)
master1.tree
#> division gender
#> 1 Master 1 Male
#> 2 Master 1 Male
#> 3 Master 1 Male
#> 4 Master 1 Male
#> 5 Master 1 Male
library(rvest)
library(tidyverse)
pageMen = read_html('https://www.bjjcompsystem.com/tournaments/1869/categories')
get_links <- pageMen %>%
html_nodes('.categories-grid__category a') %>%
html_attr('href') %>%
paste0('https://www.bjjcompsystem.com', .)
# extract numerical part of link
numbers = str_sub(get_links, - 7, - 1)
numbers = as.numeric(numbers)
## create empty vector ----------------------------
master1.tree = data.frame()
## Create for loop ---------------------------------
for (i in numbers){
url <- read_html(paste0('https://www.bjjcompsystem.com/tournaments/1869/categories/', i))
ageDivision <- url %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- url %>%
html_nodes('.category-title__age-division+ .category-title__label') %>%
html_text()
matches = data.frame('division' = ageDivision,'gender' = gender)
master1.tree <- rbind(master1.tree, matches)
}

Classification issues between Rvest and Map_dfr during web-scrape

I'm currently scraping stats from a website, but on certain stat pages I hit a snag with the following prompt:
Error: Column avg can't be converted from numeric to character
I try something like mutate(avg = avg %>% as.numeric), but then I get the prompt the column avg can't be found.
The issue in the code below occurs whenever I add stat_id 336 or 340. Any ideas on how to solve this?
library(dplyr)
library(tidyverse)
library(janitor)
library(rvest)
library(magrittr)
df <- expand.grid(
tournament_id = c("t464", "t054", "t047"),
stat_id = c("02564", "101", "102", "336", "340")
) %>%
mutate(
links = paste0(
'https://www.pgatour.com/stats/stat.',
stat_id,
'.y2019.eon.',
tournament_id,
'.html'
)
) %>%
as_tibble()
# Function to get the table
get_info <- function(link, tournament) {
link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = rank_this_week %>%
as.character) %>%
mutate(tournament)
}
# Retrieve the tables and bind them
test12 <- df %$%
map2_dfr(links, tournament_id, get_info)
test12
You generally don't want to put a pipe inside of a dplyr verb, or at least I have never before seen that done. Not sure why you need that in this example as average automatically parses as numeric. Try this instead:
# Function to get the table
get_info <- function(link, tournament_id) {
data <- link %>%
read_html() %>%
html_table() %>%
.[[2]] %>%
clean_names() %>%
select(-rank_last_week ) %>%
mutate(rank_this_week = as.integer(str_extract(rank_this_week, "\\d+")))
try(data <- mutate(data, avg = as.character(avg)), silent = TRUE)
try(data <- mutate(data, total_distance_feet = as.character(total_distance_feet)), silent = TRUE)
data
}
test12 <- df %>%
mutate(tables = map2(links, tournament_id, get_info)) %>%
tidyr::unnest(everything())

Web scraping with Rvest -- Return NA if node is not found?

I am a bit stuck here. I would like to scrape data from a website, and extract a few things like user ratings, comments etc.
I am trying to add the data to a data frame.
Below is the code i have so far:
# Read html and select the URLs for each game review.
library(rvest)
library(dplyr)
library(plyr)
# Read the webpage and the number of ratings.
getGame <- function(metacritic_game) {
total_ratings<- metacritic_game %>%
html_nodes("strong") %>%
html_text()
total_ratings <- ifelse(length(total_ratings) == 0, NA,
as.numeric(strsplit(total_ratings, " ") [[1]][1]))
# Get the game title and the platform.
game_title <- metacritic_game %>%
html_nodes("h1") %>%
html_text()
game_platform <- metacritic_game %>%
html_nodes(".platform a") %>%
html_text()
game_platform <- strsplit(game_platform," ")[[1]][57:58]
game_platform <- gsub("\n","", game_platform)
game_platform<- paste(game_platform[1], game_platform[2], sep = " ")
game_publisher <- metacritic_game %>%
html_nodes(".publisher a:nth-child(1)") %>%
html_attr("href") %>%
strsplit("/company/")%>%
unlist()
game_publisher <- gsub("\\W", " ", game_publisher)
game_publisher <- strsplit(game_publisher,"\\t")[[2]][1]
release_date <- metacritic_game %>%
html_nodes(".release_data .data") %>%
html_text()
user_ratings <- metacritic_game %>%
html_nodes("#main .indiv") %>%
html_text() %>%
as.numeric()
user_name <- metacritic_game %>%
html_nodes(".name a") %>%
html_text()
review_date <- metacritic_game %>%
html_nodes("#main .date") %>%
html_text()
user_comment <- metacritic_game %>%
html_nodes("#main .review_section .review_body") %>%
html_text()
record_game <- data.frame(game_title = game_title,
game_platform = game_platform,
game_publisher = game_publisher,
username = user_name,
ratings = user_ratings,
date = review_date,
comments = user_comment)
}
metacritic_home <-read_html("https://www.metacritic.com/browse/games/score/metascore/90day/all/filtered")
game_urls <- metacritic_home %>%
html_nodes("#main .product_title a") %>%
html_attr("href")
get100games <- function(game_urls) {
data <- data.frame()
i = 1
for(i in 1:length(game_urls)) {
metacritic_game <- read_html(paste0("https://www.metacritic.com",
game_urls[i], "/user-reviews"))
record_game <- getGame(metacritic_game)
data <-rbind.fill(data, record_game)
print(i)
}
data
}
df100games <- get100games(game_urls)
Some of the links, though, do not have any user reviews and as a result
rvest is not able to find the node and I get the following error: Error in data.frame(game_title = game_title, game_platform = game_platform, :
arguments imply differing number of rows: 1, 0.
I have tried to include ifelse statements such as:
username = ifelse(length(user_name) !=0 , user_name, NA),
ratings = ifelse(length(user_ratings) != 0,
user_ratings, NA),
date = ifelse(length(review_date) != 0,
review_date, NA),
comments = ifelse(length(user_comment) != 0,
user_comment, NA))
However, the data frame only returns one review per game instead of returning all the reviews.. Any thoughts on this?
Thanks..!
You can use the function operator possibly form the purrr package:
df100games <- purrr::map(game_urls, purrr::possibly(get100games, NULL)) %>%
purrr::compact() %>%
dplyr::bind_rows()
I believe this will return your desired output.

rvest, How to have NA values in html_nodes for creating datatables

So I'm trying to make a data table of some information on a website. This is what I've done so far.
library(rvest)
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
name_nodes <- html_nodes(page,".grpl-name a")
name_text <- html_text(name_nodes)
df <- data.frame(matrix(unlist(name_text)), stringsAsFactors = FALSE)
library(tidyverse)
df <- df %>% mutate(id = row_number())
desc_nodes <- html_nodes(page, ".grpl-purpose")
desc_text <- html_text(desc_nodes)
df <- left_join(df, data.frame(matrix(unlist(desc_text)),
stringsAsFactors = FALSE) %>%
mutate(id = row_number()))
email_nodes <- html_nodes(page, ".grpl-contact a")
email_text <- html_text(email_nodes)
df <- left_join(df, data.frame(matrix(unlist(email_text)),
stringsAsFactors = FALSE) %>%
mutate(id = row_number()))
This has been working until I got to the emails part. A few of the entries do not have emails. In the data frame, instead of the appropriate rows showing the NA value for the email, the last three rows show an NA value.
How do I make it so the appropriate rows show have the NA value instead of just the last 3 rows?
The key for solving this problem is to find the 20 parent nodes which are known to exist for each student group. With this list of parent nodes, use the html_node function on each parent node. The html_node function will return one result or NA depending if the desired tag exists. I would recommend this technique, any time there is a variable number of sub nodes.
library(rvest)
library(dplyr)
url <- 'https://uws-community.symplicity.com/index.php?s=student_group'
page <- html_session(url)
#find group names
name_text <- html_nodes(page,".grpl-name a") %>% html_text()
df <- data.frame(name_text, stringsAsFactors = FALSE)
df <- df %>% mutate(id = row_number())
#find text description
desc_text <- html_nodes(page, ".grpl-purpose") %>% html_text()
df$desc_text <- trimws(desc_text)
#find emails
# find the parent nodes with html_nodes
# then find the contact information from each parent using html_node
email_nodes<-html_nodes(page, "div.grpl-grp") %>% html_node( ".grpl-contact a") %>% html_text()
df$emails<-email_nodes
I also took the opportunity to simplify your code, since the lists are all 20 elements long, there is no reason for the unlist/ matrix/ mutate function do add the additional columns onto the data frame.

r rvest scraping multiple urls with multiple websites and missing values in some nodes

I would like to scrape employee reviews from kununu. The kununu site has two specific features that need to be taken care of: 1. only the first 10 reviews per company are shown on the first site and 2. not all items I want to collect are present for each review (missing values in some nodes).
My code works fine for one company. But I do not get it to work for a list of urls (e.g. url <- c("https://www.kununu.com/de/novartis-pharma/kommentare", "https://www.kununu.com/de/merckaa/kommentare")
I tried many different approaches - but nothing worked so far - maybe one of you knows the trick. Thank you all very much in advance!
My working code (one company) goes like this:
url <- "https://www.kununu.com/de/novartis-pharma/kommentare"
num_of_reviews <- read_html(url) %>%
html_nodes(".company-profile-subnav .active .title-number") %>%
.[[1]] %>%
html_text()
# round up to nearest 10s
num_of_reviews_rounded <- num_of_reviews %>%
as.numeric() %>%
round_any(10, f = ceiling)
pages <- 1 : (num_of_reviews_rounded / 10)
get_reviews <- function(url){
reviews <- url %>%
read_html() %>%
html_nodes(".review-content")
quote <- reviews %>%
lapply(.%>% html_nodes(".review-title a") %>%
html_text() %>% ifelse(identical(., character(0)), NA, .)) %>% unlist
date <- reviews %>%
lapply(.%>% html_nodes(".hidden+ span") %>%
html_text() %>% ifelse(identical(., character(0)), NA, .)) %>% unlist
rating_image <- reviews %>%
lapply(.%>% html_nodes(".review-ratings .rating-group:nth-child(13) .rating-badge") %>%
html_text() %>% ifelse(identical(., character(0)), NA, .)) %>% unlist
statement <- reviews %>%
lapply(.%>% html_nodes(".col-sm-10 > div:nth-child(1)") %>%
html_text() %>% ifelse(identical(., character(0)), NA, .)) %>% unlist
a<-data.frame(quote, date, rating_image, statement,
stringsAsFactors = FALSE)
return(a)
}
list_of_dfs <- lapply(pages, function(x)get_reviews(paste0(url,"/",x)))
dfshort <- do.call(rbind, list_of_dfs)
str(dfshort)

Resources