I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)
Related
I created another post related to this here, but it caused a lot of confusion, so hopefully this post is clearer. FYI the code I initially wrote (but didn't do what I needed) for the loop is also in the link above if you want to see it.
goal
I'm trying to scrape all the competitor data from the IBJJF website. This includes each competitors division, gender, belt and weight. Although the stats only appear once at the top of the page, the below code will appropriately fill in the correct information for each competitor.
The problem
When this code is used in a for-loop, the loop does not automatically fill in the appropriate information for each competitor. For example, if the belt is black, I would like it to say black next to all competitors on that page. But I cannot figure out how to do that.
Question
How would you change this code into a loop so that the appropriate division, gender, belt, and weight is next to each competitor's name? Or if you were trying to capture all the data I mentioned from each page, how would you go about doing it?
library(rvest)
library(tidyverse)
MensUrl <- read_html('https://www.bjjcompsystem.com/tournaments/1869/categories/2053146')
## SCRAPE FIGHT INFO -------------------------------------------
# fight info
ageDivision <- MensUrl %>%
html_nodes('.category-title__age-division') %>%
html_text()
gender <- MensUrl %>%
html_nodes('.category-title__age-division+ .category-title__label') %>%
html_text()
belt <- MensUrl %>%
html_nodes('.category-title__label:nth-child(3)') %>%
html_text()
weight <- MensUrl %>%
html_nodes('.category-title__label:nth-child(4)') %>%
html_text()
fightAndMat <- MensUrl %>%
html_nodes('.bracket-match-header__where , .bracket-match-header__fight') %>%
html_text()
date = MensUrl %>%
html_nodes('.bracket-match-header__when') %>%
html_text()
CompetitorNo = MensUrl %>%
html_nodes('.match-card__competitor-n') %>%
html_text()
name = MensUrl %>%
html_nodes('.match-card__competitor-description div:nth-child(1)') %>%
html_text()
gym = MensUrl %>%
html_nodes('.match-card__club-name') %>%
html_text()
#### create match df ####
matches = data.frame('division' = ageDivision,
'gender' = gender,
'belt' = belt,
'weight' = weight,
'fightAndMat' = fightAndMat,
'date' = date,
'competitor' = CompetitorNo,
'name' = name,
'gym' = gym)
This is what the above code produces when put in a data frame. I want the final result to look the same.
how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)
I tried to scrape the data for each country from interactive pie charts here: https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types
But Selector Gadget does not allow me to select the data points on pie charts. How do I resolve this?
library(rvest)
library(dplyr)
link = "https://transparencyreport.google.com/eu-privacy/overview?site_types=start:1453420800000;end:1633219199999;country:&lu=site_types"
page = read_html(link)
percentage = page %>% html_nodes("#content_types div") %>% html_text()
"#content_types div" returns void.
If you inspect the page and look on the "Network" tab, you can see the api call being made to get the data.
The end number is the last millisecond of today.
There is some junk at the beginning, but the rest of the response is JSON.
You'll have to figure out what the category numbers mean.
Maybe there is documentation of the api somewhere.
library(magrittr)
link <- "https://transparencyreport.google.com/transparencyreport/api/v3/"
parms <- paste0("europeanprivacy/siteinfo/urlsbycontenttype?start=1453420800000&end=",
1000 * ((Sys.Date() + 1) %>% as.POSIXct() %>% as.numeric()) - 1)
page <- httr::GET(paste0(link, parms))
data <- page %>% httr::content(as = "text") %>%
substr(., regexpr("\\[\\[.*\\]\\]", .), .Machine$integer.max) %>%
jsonlite::fromJSON() %>% .[[1]] %>% .[[2]] %>% as.data.frame()
I have the following question.
I am trying to harvest data from the Booking website (for me only, in order to learn the functionality of the rvest package). Everything's good and fine, the package seems to collect what I want and to put everything in the table (dataframe).
Here's my code:
library(rvest)
library(lubridate)
library(tidyverse)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
so in this chunk I collect the data from the first 60 pages after first manually feeding the Booking search engine with the country of my choise (Spain), the dates I am interested in (just some arbitrary interval) and the number of people (I used defaults here).
Then, I add this code to select the properties I want:
read_hotel <- function(url){ # collecting hotel names
ho <- read_html(url)
headline <- ho %>%
html_nodes("span.sr-hotel__name") %>% # the node I want to read
html_text() %>%
as_tibble()
}
hotels <- map_dfr(page_booking, read_hotel)
read_pr <- function(url){ # collecting price tags
pr <- read_html(url)
full_pr <- pr %>%
html_nodes("div.bui-price-display__value") %>% #the node I want to read
html_text() %>%
as_tibble()
}
fullprice <- map_dfr(page_booking, read_pr)
... and eventually save the whole data in the dataframe:
dfr <- tibble(hotels = hotels,
price_fact = fullprice)
I collect more parameters but this doesn't matter. The final dataframe of 1500 rows and two columns is then created. But the problem is the data within the second column does not correspond to the data in the first one. Which is really strange and renders my dataframe to be useless.
I don't really understand how the package works in the background and why does it behaves that way. I also paid attention the first rows in the first column of the dataframe (hotel name) do not correspond to the first hotels I see on the website. So it seems to be a different search/sort/filter criteria the rvest package uses.
Could you please explain me the processes take place during the rvest node hoping?
I would really appreciate at least some explanation, just to better understand the tool we work with.
You shouldn't scrape hotels' name and price separately like that. What you should do is get all nodes of items (hotels), then scrape the name and price relatively of each hotel. With this method, you can't mess up the order.
library(rvest)
library(purrr)
page_booking <- c("https://www.booking.com/searchresults.html?aid=397594&label=gog235jc-1FCAEoggI46AdIM1gDaDuIAQGYAQe4ARfIAQzYAQHoAQH4AQyIAgGoAgO4Atap6PoFwAIB0gIkY2RhYmM2NTUtMDRkNS00ODY1LWE3MDYtNzQ1ZmRmNjY3NWY52AIG4AIB&sid=409e05f0cfc7a9e98de21dc3e633dbd6&tmpl=searchresults&ac_click_type=b&ac_position=0&checkin_month=9&checkin_monthday=10&checkin_year=2020&checkout_month=9&checkout_monthday=17&checkout_year=2020&class_interval=1&dest_id=197&dest_type=country&from_sf=1&group_adults=2&group_children=0&label_click=undef&no_rooms=1&offset=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&shw_aparth=1&slp_r_match=0&src=index&src_elem=sb&srpvid=eb0e56a23d6c0004&ss=Spanien&ss_raw=spanien&ssb=empty&top_ufis=1&selected_currency=USD&changed_currency=1&top_currency=1&nflt=") %>%
paste0(1:60) %>%
paste0(c("?ie=UTF8&pageNumber=")) %>%
paste0(1:60) %>%
paste0(c("&pageSize=10&sortBy=recent"))
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
map_dfr(
items,
function(item) {
data.frame(
hotel = item %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = item %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
}
)
(The dots start the XPath syntaxes present the current node which is the hotel item.)
Update:
Update the code that I think faster but still does the job:
hotels <-
map_dfr(
page_booking,
function(url) {
pg <- read_html(url)
items <- pg %>%
html_nodes(".sr_item")
data.frame(
hotel = items %>% html_node(xpath = "./descendant::*[contains(#class,'sr-hotel__name')]") %>% html_text(trim = T),
price = items %>% html_node(xpath = "./descendant::*[contains(#class,'bui-price-display__value')]") %>% html_text(trim = T)
)
}
)
I seem to always have a problem scraping reference sites using either Python or R. Whenever I use my normal xpath approach (Python) or Rvest approach in R, the table I want never seems to be picked up by the scraper.
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[#id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[#id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(#id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
I'm trying to scrape the starting lineup tables. The first bit of code pulls the urls for all boxscores in 2016, and the for loop goes to each boxscore page with the hopes of extracting the tables led by "Insert Team Here" Starters.
Here's one link for example: 'https://www.pro-football-reference.com/boxscores/201609110rav.htm'
When I run the code above, the home_starters and home_starters2 objects contain zero elements (when ideally it should contain the table or elements of the table I'm trying to bring in).
I appreciate the help!
I've spent the last three hours trying to figure this out. This is how it shoudl be done. This is given my example but I'm sure you could apply it to yours.
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()