Scraping Google with rvest (2022 layout update)

Scraping Google with rvest (2022 layout update) - r

A few weeks ago people in this site helped me with a code to get links, titles, and text from a google search using rvest. Now, I am trying to use the same code as provided in:
How to retrieve hyperlinks in google search using rvest
How to retrieve text below titles from google search using rvest
And is not working, giving next results:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
Result:
# A tibble: 0 x 2
# ... with 2 variables: title <chr>, text <chr>
And second part:
#Part 2
titles <- html_nodes(first_page, xpath = "//div/div/a/h3")
titles %>%
html_elements(xpath = "./parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")
Result:
character(0)
But in the past few weeks ago, this worked. Is it possible to fix this issue?

It looks like Google decided to change their HTML layout, perhaps there were too many of us scrapers.
Here you go:
library(rvest)
library(tidyverse)
#Part 1
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
title <- "//div/div/div/a/div/div/h3/div"
text <- paste0(title, "/parent::h3/parent::div/parent::div/parent::a/parent::div/following-sibling::div/div[1]/div[1]/div[1]/div[1]/div[1]")
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
And part 2:
titles <- html_nodes(first_page, xpath = "//div/div/div/a/div/div/h3/div")
titles %>%
html_elements(xpath = "./parent::h3/parent::div/parent::div/parent::a") %>%
html_attr("href") %>%
str_extract("https.*?(?=&)")

Related

Different number of rows for headlines and urls when web-scraping Google News using rvest

Testing out different keywords on Google News to web-scrape headlines and urls, but somehow some keywords do not have matching number of headlines and urls.
library(rvest)
library(stringr)
library(magrittr)
link = "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Results:
Error in data.frame(headline, url) :
arguments imply differing number of rows: 82, 85
But with other keywords, this seems to work fine.
link = "https://news.google.com/search?q=international%20petroleum&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline = read_html(link) %>% html_nodes('.DY5T1d') %>% html_text()
url = read_html(link) %>% html_nodes(".VDXfz") %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
data.frame(headline, url)
Anyone knows the issue for this, and how to fix it? Thanks

With those selectors you are extracting headlines from different nodes than hrefs and there doesn't seem to be fixed 1:1 relation between those two. At the time of writing your first search results with some nested headlines and that's probably the reason why your headline and url count does not match.
Get the url and text from the same node and you should be covered:
url <- "https://news.google.com/search?q=onn%20hafiz&hl=en-MY&gl=MY&ceid=MY%3Aen"
headline_links <- read_html(url) %>% html_nodes('a.DY5T1d')
data.frame(
headline = headline_links %>% html_text(),
url = headline_links %>% html_attr("href") %>% str_sub(2) %>% paste0("https://news.google.com", .)
)

Using rvest to webscrape multiple pages

I am trying to extract all speeches given by Melania Trump from 2016-2020 at the following link: https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush. I am trying to use rvest to do so. Here is my code thus far:
# get main link
link <- "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/remarks-and-statements-the-first-lady-laura-bush"
# main page
page <- read_html(link)
# extract speech titles
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
# extract year of speech
year <- page %>% html_nodes(".date-display-single") %>% html_text()
# extract name of person giving speech
flotus <- page %>% html_nodes(".views-field-title-1.nowrap") %>% html_text()
get_text <- function(title_link){
speech_page = read_html(title_links)
speech_text = speech_page %>% html_nodes(".field-docs-content p") %>%
html_text() %>% paste(collapse = ",")
return(speech_page)
}
text = sapply(title_links, FUN = get_text)
I am having trouble with the following line of code:
title <- page %>% html_nodes("td.views-field-title") %>% html_text()
title_links = page %>% html_nodes("td.views-field-title") %>%
html_attr("href") %>% paste("https://www.presidency.ucsb.edu/",., sep="")
title_links
In particular, title_links yields a series of links like this: "https://www.presidency.ucsb.eduNA", rather than the individual web pages. Does anyone know what I am doing wrong here? Any help would be appreciated.

You are querying the wrong css node.
Try:
page %>% html_elements(css = "td.views-field-title a") %>% html_attr('href')
[1] "https://www.presidency.ucsb.edu/documents/remarks-mrs-laura-bush-the-national-press-club"
[2] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-un-commission-the-status-women-international-womens-day"
[3] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-colorado-early-childhood-cognitive-development-summit"
[4] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-10th-anniversary-the-holocaust-memorial-museum-and-opening-anne"
[5] "https://www.presidency.ucsb.edu/documents/remarks-the-first-lady-the-preserve-america-initiative-portland-maine"

Any tip for start scraping an e-commerce site with RVEST?

I am trying to scrap some data from an ecommerce site using rvest. I haven't found any good examples to guide me. Any idea about it?
Let's put as an example how I started:
library(rvest)
library(purrr)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap the titles section
title_html <- html_nodes(webpage,'.main-title')
#Converting the title data to text
title <- html_text(title_html)
head(title)
#Using CSS selectors to scrap the price section
price <- html_nodes(webpage,'.item__price')
price <- html_text(price)
price
So, I would like to do two basic things:
Entering in each product and take some data from them.
Pagination to all pages
Any help?
Thank you.

Scrape that info is not difficult and is doable with rvest.
What you need to do is to get all the hrefs and loop on them. To do it, you need to use html_attr()
Following code should do the job:
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages[1] <- url_base
#create an empty table to store results
result_table <- tibble()
for(page in all_pages){
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
After you get link to each item, you can loop on the item links.
To View more pages
As you can see, there is a pattern in the suffix; you can simply add the number by 50 each time to navigate more pages.
> all_pages
[1] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/"
[2] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_51"
[3] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_101"
[4] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_151"
[5] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_201"
[6] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_251"
[7] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_301"
[8] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_351"
[9] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_401"
[10] "https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_451"
So we can do this:
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",seq.int(from = 51,by = 50,length.out = 40))
Scrape each page
Let's use this page as an example: https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM
pagesource <- read_html("https://articulo.mercadolibre.com.uy/MLU-449598178-protector-funda-clear-cover-samsung-galaxy-note-8-_JM")
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion)
print(current_table)
# A tibble: 1 x 4
product_description product_price n_vendor n_opinion
<chr> <chr> <chr> <chr>
1 Protector Funda Clear Cover Samsung Galaxy Note 8 14 14vendidos 2
You can loop the code chunk above and get all info.
Let's combine it all together
The following code should work, you can remove the 5-page limit to scrape all product information.
library(tidyverse)
library(rvest)
#Specifying the url
url_base <- 'https://telefonia.mercadolibre.com.uy/accesorios-celulares/'
#You need to get href and loop on hrefs
all_pages <- url_base %>% read_html() %>% html_nodes(".pagination__page > a") %>% html_attr("href")
all_pages <- c(url_base,
str_c("https://telefonia.mercadolibre.com.uy/accesorios-celulares/_Desde_",
seq.int(from = 51,by = 50,length.out = 40)))
#create an empty table to store results
result_table <- tibble()
for(page in all_pages[1:5]){ #as an example, only scrape the first 5 pages
page_source <- read_html(page)
title <- html_nodes(page_source,'.item__info-title') %>% html_text()
price <- html_nodes(page_source,'.item__price') %>% html_text()
item_link <- html_nodes(page_source,'.item__info-title') %>% html_attr("href")
temp_table <- tibble(title = title, price = price, item_link = item_link)
result_table <- bind_rows(result_table,temp_table)
}
#loop on result table(item_link):
product_table <- tibble()
for(i in 1:nrow(result_table)){
pagesource <- read_html(result_table[[i,"item_link"]])
n_vendor <- pagesource %>% html_node(".item-conditions") %>% html_text() %>% remove_nt()
product_description <- pagesource %>% html_node(".item-title__primary") %>% html_text() %>% remove_nt()
currency_symbol <- pagesource %>% html_node(".price-tag-symbol") %>% html_text()
n_opinion <- pagesource %>% html_node(".average-legend span:nth-child(1)") %>% html_text()
product_price <- pagesource %>% html_nodes(".price-tag-fraction") %>% html_text()
current_table <- tibble(product_description = product_description,
currency_symbol = currency_symbol,
product_price = product_price,
n_vendor = n_vendor,
n_opinion = n_opinion,
item_link = result_table[[i,"item_link"]])
product_table <- bind_rows(product_table,current_table)
}
Result:
Some issues
There are still some bugs in the code, for example:
On this page, there are two items that match the css selector, which may break the code. There are some solutions though:
Store result in a list instead of a table
Use a more accurate CSS selector
concatenate string whenever there is more than one result and
etc.
You can choose any methods that fit your requirement.
Also, if you want to scrape in quantity, you may want to use tryCatch to prevent any errors from breaking your loop.
About apis
Api is totally different with web scraping, you may want to read some more tutorials about api if you want to use it.

R: Using Rvest to loop through list

I try to scape the prices, area and addresses from all flats of this homepage (https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden)
Getting the data for one list element with Rvest and xpath works fine (see code), but I don´t know how to get the ID of each list element to loop through all elements.
Here is a part of the html-code with the data-go-to-expose-id I need for the loop. How can I get all IDs?
<span class="slick-bg-layer"></span><img alt="Immobilienbild" class="gallery__image block height-full" src="https://pictures.immobilienscout24.de/listings/541dfd45-c75a-4da7-a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80">a831-3339264d578b-1193970198.jpg/ORIG/legacy_thumbnail/532x399/format/jpg/quality/80"></a>
And here is my current R-code to fetch the data from one list element:
library(rvest)
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
address <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[2]/div[2]/a') %>% html_text()
price <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[1]/dd') %>% html_text()
area <- url %>% read_html(encoding = "UTF-8") %>% html_node(xpath = '//*[#id="result-103049161"]/div[2]/div[2]/div[1]/div[3]/div/div[1]/dl[2]/dd') %>% html_text()

Does this get what you are after
library("tidyverse")
library("httr")
library("rvest")
url <- "https://www.immobilienscout24.de/Suche/S-T/P-1/Wohnung-Miete/Sachsen/Dresden"
x <- read_html(url)
x %>%
html_nodes("#listings") %>%
html_nodes(".result-list__listing") %>%
html_attr("data-id")

extract all the possible text from a webpage in R

I used this script to extract the text from a webpage
url <- "http://www.dlink.com/it/it"
doc <- getURL(url)
#get the text from the body
html <- htmlTreeParse(doc, useInternal = TRUE)
txt <- xpathApply(html, "//body//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)]", xmlValue)
txt<-toString(txt)
but the problem is that it takes just the words in the first page, how can I extend it to the whole website?

I'd go with rvest to scrape the links and purrr to iterate:
library(rvest)
library(purrr)
url <- "http://www.dlink.com/it/it"
r <- read_html(url) %>%
html_nodes('a') %>%
html_attr('href') %>%
Filter(function(f) !is.na(f) & !grepl(x = f, pattern = '#|facebook|linkedin|twitter|youtube'), .) %>%
map(~{
print(.x)
html_session(url) %>%
jump_to(.x) %>%
read_html() %>%
html_nodes('body') %>%
html_text() %>%
toString()
})
I filtered out social nets and dead links from the list of links, some more tuning might be in order.
Be advised that you will be scraping a lot of garbage. Some more targeting on what to scrape inside each page might be needed (ie: something more specfic than the whole body tag)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Scraping Google with rvest (2022 layout update) - r

Related

Different number of rows for headlines and urls when web-scraping Google News using rvest

Using rvest to webscrape multiple pages

Any tip for start scraping an e-commerce site with RVEST?

R: Using Rvest to loop through list

extract all the possible text from a webpage in R

Categories

Resources