How to scrape Google News results into a data.frame with rvest - r

Through other SO questions I've found how to get headlines but I don't know where the Google code stores the links.
I'm wanting a 2 column data.frame of the headlines and their corresponding links.
library(rvest)
library(tidyverse)
dat <- read_html("https://news.google.com/search?q=coronavirus&hl=en-US&gl=US&ceid=US%3Aen") %>%
html_nodes('.DY5T1d') %>% #
html_text()
dat

After a lot of inspecting the Google web code I found what I was looking for. I also came across the descriptions so I basically re-built the Google news RSS feed.
library(rvest)
library(tidyverse)
news <- function(term) {
html_dat <- read_html(paste0("https://news.google.com/search?q=",term,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link
)
return(news_dat)
}
news("coronavirus")

Related

Scraping Google with rvest (2022 layout update)

A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest. Now, I am trying to use the same code as provided in:
How to retrieve hyperlinks in google search using rvest
How to retrieve text below titles from google search using rvest
And is not working, giving next results:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
Result:
# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>
And second part:
#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
Result:
character(0)
But in the past few weeks ago, this worked. Is it possible to fix this issue?
It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.
Here you go:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/div/div/h3/div"
text <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
And part 2:
titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")
titles %>%
html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")

Rvest scraping google news with different number of rows

I am using Rvest to scrape google news.
However, I encounter missing values in element "Time" from time to time on different keywords. Since the values are missing, it will end up having "different number of rows error" for the data frame of scraping result.
Is there anyway to fill-in NA for these missing values?
Below is the example of the code I am using.
html_dat <- read_html(paste0("https://news.google.com/search?q=",Search,"&hl=en-US&gl=US&ceid=US%3Aen"))
dat <- data.frame(Link = html_dat %>%
html_nodes('.VDXfz') %>%
html_attr('href')) %>%
mutate(Link = gsub("./articles/","https://news.google.com/articles/",Link))
news_dat <- data.frame(
Title = html_dat %>%
html_nodes('.DY5T1d') %>%
html_text(),
Link = dat$Link,
Description = html_dat %>%
html_nodes('.Rai5ob') %>%
html_text(),
Time = html_dat %>%
html_nodes('.WW6dff') %>%
html_text()
)
Without knowing the exact page you were looking at I tried the first Google news page.
In the Rvest page, html_node (without the s) will always return a value even it is NA. Therefore in order to keep the vectors the same length, one needed to find the common parent node for all of the desired data nodes. Then parse the desired information from each one of those nodes.
Assuming the Title node is most complete, go up 1 level with xml_parent() and attempt to retrieving the same number of description nodes, this didn't work. Then tried 2 levels up using xml_parent() %>% xml_parent(), this seems to work.
library(rvest)
url <-"https://news.google.com/topstories?hl=en-US&gl=US&ceid=US:en"
html_dat <- read_html(url)
Title = html_dat %>% html_nodes('.DY5T1d') %>% html_text()
# Link = dat$Link
Link = html_dat %>% html_nodes('.VDXfz') %>% html_attr('href')
Link <- gsub("./articles/", "https://news.google.com/articles/",Link)
#Find the common parent node
#(this was trial and error) Tried the parent then the grandparent
Titlenodes <- html_dat %>% html_nodes('.DY5T1d') %>% xml_parent() %>% xml_parent()
Description = Titlenodes %>% html_node('.Rai5ob') %>% html_text()
Time = Titlenodes %>% html_node('.WW6dff') %>% html_text()
answer <- data.frame(Title, Time, Description, Link)

R scraping reviews from multiple pages on TripAdvisor

I'm trying to pull out a few pages of reviews from TripAdvisor for a academic project.
Here's my attempt using R
#Load libraries
library(rvest)
library(RSelenium)
# main url for stadium
urlmainlist=c(
hampdenpark="http://www.tripadvisor.com.ph/Attraction_Review-g186534-d214132-Reviews-Hampden_Park-Glasgow_Scotland.html"
)
# Specify how many search pages and counter
morepglist=list(
hampdenpark=seq(10,360,10)
)
#----------------------------------------------------------------------------------------------------------
# create pickstadium variable
pickstadium="hampdenpark"
# get list of urllinks corresponding to different pages
# url link for first search page
urllinkmain=urlmainlist[pickstadium]
# counter for additional pages
morepg=as.numeric(morepglist[[pickstadium]])
urllinkpre=paste(strsplit(urllinkmain,"Reviews-")[[1]][1],"Reviews",sep="")
urllinkpost=strsplit(urllinkmain,"Reviews-")[[1]][2]
urllink=rep(NA,length(morepg)+1)
urllink[1]=urllinkmain
for(i in 1:length(morepg)){
urllink[i+1]=paste(urllinkpre,"-or",morepg[i],"-",urllinkpost,sep="")
}
head(urllink)
write.csv(urllink,'urllink.csv')
##########
#SCRAPING#
##########
library(rvest)
library(RSelenium)
#install.packages('RSelenium')
testurl <- read.csv("urllink.csv", header=FALSE, quote="'", stringsAsFactors = F)
testurl=testurl[-1,]
testurl=testurl[,-1]
testurl=as.data.frame(testurl)
testurl=gsub('"',"",testurl$testurl)
list<-unlist(testurl)
tripadvisor <- NULL
#Scrape
for(i in 1:length(list)){
reviews <- list[i] %>%
read_html() %>%
html_nodes("#REVIEWS .innerBubble")
id <- reviews %>%
html_node(".quote a") %>%
html_attr("id")
rating <- reviews %>%
html_node(".rating .rating_s_fill") %>%
html_attr("alt") %>%
gsub(" of 5 stars", "", .) %>%
as.integer()
date <- reviews %>%
html_node(".rating .ratingDate") %>%
html_attr("title") %>%
strptime("%b %d, %Y") %>%
as.POSIXct()
review <- reviews %>%
html_node(".entry .partial_entry") %>%
html_text()%>%
as.character()
rowthing <- data.frame(id, review,rating, date, stringsAsFactors = FALSE)
tripadvisor<-rbind(rowthing, tripadvisor)
}
However this results in an empty tripadvisor dataframe. Any help on fixing this would be appreciated.
Additional Question
I'd like to capture the full reviews, as my code currently intends to capture partial entries only. For each review, I'd like to automatically click on the 'More' link and then extract the full review.
Here too, any help would be grately appreciated.

Looping through a list of webpages with rvest follow_link

I'm trying to webscrape the government release calendar: https://www.gov.uk/government/statistics and use the rvest follow_link functionality to go to each publication link and scrape text from the next page. I have this working for each single page of results (40 publications are displayed per page), but can't get a loop to work so that I can run the code over all publications listed.
This is the code I run first to get the list of publications (just from the first 10 pages of results):
#Loading the rvest package
library('rvest')
library('dplyr')
library('tm')
#######PUBLISHED RELEASES################
###function to add number after 'page=' in url to loop over all pages of published releases results (only 40 publications per page)
###check the site and see how many pages you want to scrape, to cover months of interest
##titles of publications - creates a list
publishedtitles <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_text()
})
##Dates of publications
publisheddates <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.public_timestamp') %>%
html_text()
})
##Organisations
publishedorgs <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('.organisations') %>%
html_text()
})
##Links to publications
publishedpartial_links <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
url_base %>% read_html() %>%
html_nodes('h3 a') %>%
html_attr('href')
})
#Check all lists are the same length - if not, have to deal with missings before next step
# length(publishedtitles)
# length(publisheddates)
# length(publishedorgs)
# length(publishedpartial_links)
#str(publishedorgs)
#Combining all the lists to form a data frame
published <-data.frame(Title = unlist(publishedtitles), Date = unlist(publisheddates), Organisation = unlist(publishedorgs), PartLinks = unlist(publishedpartial_links))
#adding prefix to partial links, to turn into full URLs
published$Links = paste("https://www.gov.uk", published$PartLinks, sep="")
#Drop partial links column
keeps <- c("Title", "Date", "Organisation", "Links")
published <- published[keeps]
Then I want to run something like the below, but over all pages of results. I've ran this code manually changing the parameters for each page, so know it works.
session1 <- html_session("https://www.gov.uk/government/statistics?page=1")
list1 <- list()
for(i in published$Title[1:40]){
nextpage1 <- session1 %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
df1 <- data.frame(text=list1)
df1 <-as.data.frame(t(df1))
}
So the above would need to change page=1 in the html_session, and also the publication$Title[1:40] - I'm struggling with creating a function or loop that includes both variables.
I think I should be able to do this using lapply:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
nextpage1 <- url_base %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
But I get the error
Error in follow_link(., i) : is.session(x) is not TRUE
I've also tried other methods of looping and turning it into a function but didn't want to make this post too long!
Thanks in advance for any suggestions and guidance :)
It looks like you may have just need to start a session inside the lapply function. In the last chunk of code, url_base is simply a text string that gives the base URL. Would something like this work:
df <- lapply(paste0('https://www.gov.uk/government/statistics?page=', 1:10),
function(url_base){
for(i in published$Title[1:40]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
To change the published$Title[1:40] for each iteraction of the lapply function, you could make an object that holds the lower and upper bounds of the indices:
lowers <- cumsum(c(1, rep(40, 9)))
uppers <- cumsum(rep(40, 10))
Then, you could include those in the call to lapply
df <- lapply(1:10, function(j){
url_base <- paste0('https://www.gov.uk/government/statistics?page=', j)
for(i in published$Title[lowers[j]:uppers[j]]){
tmpSession <- html_session(url_base)
nextpage1 <- tmpSession %>% follow_link(i) %>% read_html()
list1[[i]]<- nextpage1 %>%
html_nodes(".grid-row") %>% html_text()
}
}
)
Not sure if this is what you want or not, I might have misunderstood the things that are supposed to be changing.

Scraping Amazon customer reviews with missing data

I'd like to scrape Amazon customer reviews and while my code works fine if there's no "missing" information, converting the scraped data to a data frame doesn't work anymore if parts of the data are missing (arguments imply differing number of rows).
This is an example code:
library(rvest)
url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent")
get_reviews <- function(url) {
title <- url %>%
html_nodes("#cm_cr-review_list .a-color-base") %>%
html_text()
author <- url %>%
html_nodes(".author") %>%
html_text()
df <- data.frame(title, author, stringsAsFactors = F)
return(df)
}
results <- get_reviews(url)
In this case, "missing" means that there's no author information provided for multiple customer reviews (Ein Kunde simply means A customer in German).
Does anyone have an idea on how to fix this? Any help is appreciated. Thanks in advance!
would say here is the answer for your question (link)
Each on the
'div[id*=customer_review]'
and then check whether there is that value for the author or not.
Adapting an approach from the link Nardack provided, I could scrape the data with the following code:
library(dplyr)
library(rvest)
get_reviews <- function(node){
r.title <- html_nodes(node, ".a-color-base") %>%
html_text()
r.author <- html_nodes(node, ".author") %>%
html_text()
df <- data.frame(
title = ifelse(length(r.title) == 0, NA, r.title),
author = ifelse(length(r.author) == 0, NA, r.author),
stringsAsFactors = F)
return(df)
}
url <- read_html("https://www.amazon.de/product-reviews/3980710688/ref=cm_cr_dp_d_show_all_btm?ie=UTF8&reviewerType=all_reviews&pageNumber=42&sortBy=recent") %>% html_nodes("div[id*=customer_review]")
out <- lapply(url, get_reviews) %>% bind_rows()

Resources