Extract meta description from web pages using R

Extract meta description from web pages using R - r

Hi i am trying to retrive these wepages meta descriptions
from the pages sources "
Data<-data.frame(Pages=c(
"http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html",
"http://boingboing.net/2016/06/16/omg-the-japanese-trump-commer.html",
"http://boingboing.net/2016/06/16/omar-mateen-posted-to-facebook.html"))
Desired output
Data$Meta_Description<-data.frame(Extracted=c(
"Sanford Wallace gets 2.5 years in prison for 27 million Facebook",
"OMG, this Japanese Trump Commercial is everything",
"Omar Mateen posted to Facebook during Orlando mass shooting"))
I was trying to achieved this task with httr but I am not ablet to get it in the desired ouput format or extract content from what it is retrieved using the GET command
library (httr)
resp<-GET ("http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html")
str(resp)
List of 10
$ url : chr "http://boingboing.net/2016/06/16/spam-king-sanford-wallace.html"
$ status_code: int 200
$ headers :List of 22
..$ server : chr "Apache/2.2"
The field that i need to extract from the source code is after this string
<meta itemprop="description" content="
Like so
<meta itemprop="description" content="'Spam King'
Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"

You really only need rvest. Since they're all <h1> headings, you can just iterate over the list of URLs, picking out the headings:
library(rvest)
sapply(Data$Pages,
function(url){
url %>%
as.character() %>% # in case strings are stored as factors
read_html() %>%
html_nodes('h1') %>%
html_text()
})
# [1] "'Spam King' Sanford Wallace gets 2.5 years in prison for 27 million Facebook scam messages"
# [2] "OMG, this Japanese Trump Commercial is everything"
# [3] "Omar Mateen posted to Facebook during Orlando mass shooting"
Or if you really want to scrape the <meta> tags, you can do it in the same way, though the selectors are more of a pain:
sapply(Data$Pages, function(url){
url %>%
as.character() %>%
read_html() %>%
html_nodes(xpath = '//meta[#itemprop="description"]') %>%
html_attr('content')
})
You get the same results either way.

Related

How to retrieve text below titles from google search using rvest

This is a follow up question for this one:
How to retrieve titles from google search using rvest
In this time I am trying to get the text behind titles in google search (circled in red):
Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.
The answer by #AllanCameron is very useful but I do not know how to modify it:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>%
html_text()
Many thanks for your help!

This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.
It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
For the OP and those without a Google consent page, the set up is simply:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
Now we can just apply these xpaths to get the correct nodes and extract the text from them:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (#mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~

The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.
Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.
library(wdman)
library(RSelenium)
library(tidyverse)
selServ <- selenium(
port = 4446L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4446L,
browserName = 'chrome'
)
remDr$open()
remDr$navigate("insert URL here")
text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")
sapply(text_elements, function(x) x$getElementText()) %>%
unlist() %>%
as_tibble_col("results") %>%
filter(str_length(results) > 15)
# # A tibble: 6 × 1
# results
# <chr>
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

Is there a way to scrape through multiple pages on a website in R

I am new to R and webscraping. For practice I am trying to scrape book titles from a fake website that has multiple pages ('http://books.toscrape.com/catalogue/page-1.html'), and then calculate certain metrics based on the book titles. There are 20 books on each page and 50 pages, I have managed to scrape and calculate metrics for the first 20 books, however I want to calculate the metrics for the full 1000 books on the website.
The current output looks like this:
[1] "A Light in the Attic"
[2] "Tipping the Velvet"
[3] "Soumission"
[4] "Sharp Objects"
[5] "Sapiens: A Brief History of Humankind"
[6] "The Requiem Red"
[7] "The Dirty Little Secrets of Getting Your Dream Job"
[8] "The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull"
[9] "The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics"
[10] "The Black Maria"
[11] "Starving Hearts (Triangular Trade Trilogy, #1)"
[12] "Shakespeare's Sonnets"
[13] "Set Me Free"
[14] "Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)"
[15] "Rip it Up and Start Again"
[16] "Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991"
[17] "Olio"
[18] "Mesaerion: The Best Science Fiction Stories 1800-1849"
[19] "Libertarianism for Beginners"
[20] "It's Only the Himalayas"
I want this to be 1000 books long instead of 20, this will allow me to use the same code to calculate the metrics but for 1000 books instead of 20.
Code:
url<-'http://books.toscrape.com/catalogue/page-1.html'
url %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')->titles
titles
What would be the best way to scrape every book from the website and make the list 1000 book titles long instead of 20? Thanks in advance.

Generate the 50 URLs, then iterate on them, e.g. with purrr::map
library(rvest)
urls <- paste0('http://books.toscrape.com/catalogue/page-', 1:50, '.html')
titles <- purrr::map(
urls,
. %>%
read_html() %>%
html_nodes('h3 a') %>%
html_attr('title')
)

something like this perhaps?
library(tidyverse)
library(rvest)
library(data.table)
# Vector with URL's to scrape
url <- paste0("http://books.toscrape.com/catalogue/page-", 1:20, ".html")
# Scrape to list
L <- lapply( url, function(x) {
print( paste0( "scraping: ", x, " ... " ) )
data.table(titles = read_html(x) %>%
html_nodes('h3 a') %>%
html_attr('title') )
})
# Bind list to single data.table
data.table::rbindlist(L, use.names = TRUE, fill = TRUE)

R: readLines on a URL leads to missing lines

When I readLines() on an URL, I get missing lines or values. This might be due to spacing that the computer can't read.
When you use the URL above, CTR + F finds 38 instances of text that matches "TV-". On the other hand, when I run readLines() and grep("TV-", HTML) I only find 12.
So, how can I avoid encoding/ spacing errors so that I can get complete lines of the HTML?

You can use rvest to scrape the data. For example, to get all the titles you can do :
library(rvest)
url <- 'https://www.imdb.com/search/title/?locations=Vancouver,%20British%20Columbia,%20Canada&start=1.json'
url %>%
read_html() %>%
html_nodes('div.lister-item-content h3 a') %>%
html_text() -> all_titles
all_titles
# [1] "The Haunting of Bly Manor" "The Haunting of Hill House"
# [3] "Supernatural" "Helstrom"
# [5] "The 100" "Lucifer"
# [7] "Criminal Minds" "Fear the Walking Dead"
# [9] "A Babysitter's Guide to Monster Hunting" "The Stand"
#...
#...

Rvest: Headlines returning empty list

I'm trying to replicate this tutorial on rvest here. However, at the start I'm already having issues. This is the code I'm using
library(rvest)
#Specifying the url for desired website to be scrapped
url <- 'https://www.nytimes.com/section/politics'
#Reading the HTML code from the website - headlines
webpage <- read_html(url)
headline_data <- html_nodes(webpage,'.story-link a, .story-body a')
My results when I look at headline_data return
{xml_nodeset (0)}
But in the tutorial it returns a list of length 48
{xml_nodeset (48)}
Any reason for the discrepancy?

As mentioned in the comments, there are no elements with the specified class you are searching for.
To begin, based on current tags you can get headlines with
library(rvest)
library(dplyr)
url <- 'https://www.nytimes.com/section/politics'
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_text()
#[1] "Trump’s Secrecy Fight Escalates as Judge Rules for Congress in Early Test"
#[2] "A Would-Be Trump Aide’s Demands: A Jet on Call, a Future Cabinet Post and More"
#[3] "He’s One of the Biggest Backers of Trump’s Push to Protect American Steel. And He’s Canadian."
#[4] "Accountants Must Turn Over Trump’s Financial Records, Lower-Court Judge Rules"
and to get individual URL's of those headlines you could do
url %>%
read_html() %>%
html_nodes("h2.css-l2vidh a") %>%
html_attr("href") %>%
paste0("https://www.nytimes.com", .)
#[1] "https://www.nytimes.com/2019/05/20/us/politics/mcgahn-trump-congress.html"
#[2] "https://www.nytimes.com/2019/05/20/us/politics/kris-kobach-trump.html"
#[3] "https://www.nytimes.com/2019/05/20/us/politics/hes-one-of-the-biggest-backers-of-trumps-push-to-protect-american-steel-and-hes-canadian.html"
#[4] "https://www.nytimes.com/2019/05/20/us/politics/trump-financial-records.html"

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>

This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"

You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.

This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract meta description from web pages using R - r

Related

How to retrieve text below titles from google search using rvest

Is there a way to scrape through multiple pages on a website in R

R: readLines on a URL leads to missing lines

Rvest: Headlines returning empty list

Web scraping with R and selector gadget

Categories

Resources