Attempting to scrape an "unscrapable" page? - r

I'm attempting to build a simple scraper, iterating through a website to pull two pieces of information and build myself a little reference list.
This is what the url looks like: "https://www.mtgstocks.com/prints/[[n]]"
The two pieces of information are the card name (Forbidden Alchemy) and card set (Innistrad).
Pretty straightforward, yeah? I thought so.
I attempted to pass any relevant anchors — css or xpath  — to try and isolate the two variables, but was met with "{xml_nodeset (0)}".
Here's the code that I ran:
# return page info
page_html <- read_html(httr::GET("https://www.mtgstocks.com/prints/1"))
# extract item name
page_html %>%
html_nodes("h3") %>%
html_nodes("a") %>%
html_text()
# character(0)
I've scraped enough webpages to know that this information is being hidden, but I'm not exactly sure how. Would love help!

They are gathering the information from their API, which you can see in the network section of the developer tools. There is a ton of info of card that you can gather from that API - take a look.
library(tidyverse)
library(httr2)
get_card <- function(card_num) {
cat("Scraping card num:", card_num, "\n")
data <- str_c("https://api.mtgstocks.com/prints/", card_num) %>%
request() %>%
req_perform() %>%
resp_body_json(simplifyVector = TRUE)
tibble(
name = data %>%
pluck("name"),
set = data %>%
pluck("card_set") %>%
pluck("name")
)
}
get_card(1)
# A tibble: 1 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
Or get for a range of numbers. If the card does not exists, the code returns NA.
map_dfr(1:20, possibly(get_card, otherwise = tibble(
name = NA_character_,
set = NA_character_
)))
# A tibble: 20 × 2
name set
<chr> <chr>
1 Forbidden Alchemy Innistrad
2 NA NA
3 Fortress Crab Innistrad
4 Frightful Delusion Innistrad
5 Full Moon's Rise Innistrad
6 Furor of the Bitten Innistrad
7 Gallows Warden Innistrad
8 Galvanic Juggernaut Innistrad
9 Garruk Relentless Innistrad
10 Gatstaf Shepherd Innistrad
11 Gavony Township Innistrad
12 Geist of Saint Traft Innistrad
13 Geist-Honored Monk Innistrad
14 Geistcatcher's Rig Innistrad
15 Geistflame Innistrad
16 Ghost Quarter Innistrad
17 Ghostly Possession Innistrad
18 Ghoulcaller's Bell Innistrad
19 Ghoulcaller's Chant Innistrad
20 Ghoulraiser Innistrad

Related

Scraping movie scripts failing on small subset

I'm working on scraping the lord of the rings movie scripts from this website here. Each script is broken up across multiple pages that look like this
I can get the info I need for a single page with this code:
library(dplyr)
library(rvest)
url_success <- "http://www.ageofthering.com/atthemovies/scripts/fellowshipofthering1to4.php"
success <- read_html(url_success) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(success)
Length Class Mode
[1,] 2 tbl_df list
This works for all Fellowship of the Ring pages, and all Return of the King pages. It also works for Two Towers pages covering scenes 57 to 66. However, any other Two Towers page (scenes 1-56) does not return the same result
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements("#AutoNumber1") %>%
html_table()
summary(fail)
Length Class Mode
0 list list
I've inspected the pages in Chrome, and the failing pages appear to have the same structure as the succeeding ones, including the 'AutoNumber1' table. Can anyone help with this?
Works with xpath. Perhaps ill-formed html (page doesn't seem too spec compliant)
library(rvest)
url_fail <- "http://www.ageofthering.com/atthemovies/scripts/thetwotowers1to4.php"
fail <- read_html(url_fail) %>%
html_elements( xpath = '//*[#id="AutoNumber1"]') %>%
html_table()
fail
#> [[1]]
#> # A tibble: 139 × 2
#> X1 X2
#> <chr> <chr>
#> 1 "Scene 1 ~ The Foundations of Stone\r\n\r\n\r\nThe movie opens as the … "Sce…
#> 2 "GANDALF VOICE OVER:" "You…
#> 3 "FRODO VOICE OVER:" "Gan…
#> 4 "GANDALF VOICE OVER:" "I a…
#> 5 "The scene changes to \r\n inside Moria.  Gandalf is on the Bridge … "The…
#> 6 "GANDALF:" "You…
#> 7 "Gandalf slams down his staff onto the Bridge, \r\ncausing it to crack… "Gan…
#> 8 "BOROMIR :" "(ho…
#> 9 "FRODO:" "Gan…
#> 10 "GANDALF:" "Fly…
#> # … with 129 more rows

Scraping a web page in R without using RSelenium

I’m trying to do a simple scrap in the table in the following url:
https://www.bcb.gov.br/controleinflacao/historicometas
Page Print
By what i notice is that, When using rvest::read_html or httr::GET and when acessing the page source code i can't see anything related to the table, but when acessing google chrome developer tools, i can spot the table references in the elements tab.
Examble above is a simple code where i try to acess the content of the url and search of nodes that contain tables:
library( tidyverse )
library( rvest )
url <- “https://www.bcb.gov.br/controleinflacao/historicometas”
res <- url %>%
read_html( ) %>%
html_node( “table” )
this give me:
{xml_nodeset (0)}
opening the source code of the url mentioned we can see:
view-source:https://www.bcb.gov.br/controleinflacao/historicometas
Page Source Code print
Page Developer Tool table print
By what i have searched the question is that the scripts avaible in source code load the table. I have seen some solutions that use RSelenium, but i would like to know if there is some solution where i can scrap this table without using Rselenium.
Some other related StackOverflow questions:
Scraping webpage (with R) where all elements are placed inside an <app-root> tag
scraping table from a website result as empty
(Last one is a python example)
When dealing with dynamic sites, Network tab tends to be more useful than Inspector. And often you don't have to scroll through hundreds of requests or pages of minified javascript, you rather pick a search term from rendered page to identify the api endpoint that sent that piece of information.
In this case searching for "Resolução CMN nº 2.615" pointed to the correct call, most of the site content (in pure html) was delivered as json.
library(tibble)
library(rvest)
historicometas <- jsonlite::read_json("https://www.bcb.gov.br/api/paginasite/sitebcb/controleinflacao/historicometas")
historicometas$conteudo %>%
read_html() %>%
html_element("table") %>%
html_table()
#> # A tibble: 27 × 7
#> Ano Norma Data Meta …¹ Taman…² Inter…³ Infla…⁴
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 1999 Resolução CMN nº 2.615 ​ 30/6… 8 2 6-10 8,94
#> 2 2000 Resolução CMN nº 2.615 ​ 30/6… ​6 ​2 4-8 5,97
#> 3 2001 Resolução CMN nº 2.615 ​ 30/6… ​4 ​2 2-6 7,67
#> 4 2002 Resolução CMN nº 2.744 28/6… 3,5 2 1,5-5,5 12,53
#> 5 2003* Resolução CMN nº 2.842Resolução … 28/6… 3,254 22,5 1,25-5… 9,309,…
#> 6 2004* Resolução CMN nº 2.972Resolução … 27/6… 3,755,5 2,52,5 1,25-6… 7,60
#> 7 2005 Resolução CMN nº 3.108 25/6… 4,5 2,5 2-7 5,69
#> 8 2006 Resolução CMN nº 3.210 30/6… 4,5 ​2,0 2,5-6,5 3,14
#> 9 2007 Resolução CMN nº 3.291 23/6… 4,5 ​2,0 2,5-6,5 4,46
#> 10 2008 Resolução CMN nº 3.378 29/6… 4,5 ​2,0 2,5-6,5 5,90
#> # … with 17 more rows, and abbreviated variable names ¹​`Meta (%)`,
#> # ²​`Tamanhodo intervalo +/- (p.p.)`, ³​`Intervalode tolerância (%)`,
#> # ⁴​`Inflação efetiva(Variação do IPCA, %)`
Created on 2022-10-17 with reprex v2.0.2

How to retrieve text below titles from google search using rvest

This is a follow up question for this one:
How to retrieve titles from google search using rvest
In this time I am trying to get the text behind titles in google search (circled in red):
Due to my lack of knowledge in web design I do not know how to formulate the xpath to extract the text below titles.
The answer by #AllanCameron is very useful but I do not know how to modify it:
library(rvest)
library(tidyverse)
#Code
#url
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
#Get data
first_page <- read_html(url)
titles <- html_nodes(first_page, xpath = "//div/div/div/a/h3") %>%
html_text()
Many thanks for your help!
This can all be done without Selenium, using rvest. Unfortunately, Google works differently in different locales, so for example in my locale there is a consent page that has to be navigated before I can even send a request to Google.
It seems this is not required in the OPs locale, but for those if you in the UK, you might need to run the following code first for the rest to work:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
google_handle <- httr::handle('https://www.google.com')
httr::GET('https://www.google.com', handle = google_handle)
httr::POST(paste0('https://consent.google.com/save?continue=',
'https://www.google.com/',
'&gl=GB&m=0&pc=shp&x=5&src=2',
'&hl=en&bl=gws_20220801-0_RC1&uxe=eomtse&',
'set_eom=false&set_aps=true&set_sc=true'),
handle = google_handle)
url <- httr::GET(url, handle = google_handle)
For the OP and those without a Google consent page, the set up is simply:
library(rvest)
library(tidyverse)
url <- 'https://www.google.com/search?q=Mario+Torres+Mexico'
Next we define the xpaths we are going to use to extract the title (as in the previous Q&A), and the text below the title (pertinent to this question)
title <- "//div/div/div/a/h3"
text <- paste0(title, "/parent::a/parent::div/following-sibling::div")
Now we can just apply these xpaths to get the correct nodes and extract the text from them:
first_page <- read_html(url)
tibble(title = first_page %>% html_nodes(xpath = title) %>% html_text(),
text = first_page %>% html_nodes(xpath = text) %>% html_text())
#> # A tibble: 9 x 2
#> title text
#> <chr> <chr>
#> 1 "Mario García Torres - Wikipedia" "Mario García Torres (born 1975 in Monc~
#> 2 "Mario Torres (#mario_torres25) • I~ "Mario Torres. Oaxaca, México. Luz y co~
#> 3 "Mario Lopez Torres - A Furniture A~ "The Mario Lopez Torres boutiques are a~
#> 4 "Mario Torres - Player profile | Tr~ "Mario Torres. Unknown since: -. Mario ~
#> 5 "Mario García Torres | The Guggenhe~ "Mario García Torres was born in 1975 i~
#> 6 "Mario Torres - Founder - InfOhana ~ "Ve el perfil de Mario Torres en Linked~
#> 7 "3500+ \"Mario Torres\" profiles - ~ "View the profiles of professionals nam~
#> 8 "Mario Torres Lopez - 33 For Sale o~ "H 69 in. Dm 20.5 in. 1970s Tropical Vi~
#> 9 "Mario Lopez Torres's Woven River G~ "28 Jun 2022 · From grass harvesting to~
The subtext you refer to appears to be rendered in JavaScript, which makes it difficult to access with conventional read_html() methods.
Here is a script using RSelenium which gets the results necessary. You can also click the next page element if you want to get more results etc.
library(wdman)
library(RSelenium)
library(tidyverse)
selServ <- selenium(
port = 4446L,
version = 'latest',
chromever = '103.0.5060.134', # set to available
)
remDr <- remoteDriver(
remoteServerAddr = 'localhost',
port = 4446L,
browserName = 'chrome'
)
remDr$open()
remDr$navigate("insert URL here")
text_elements <- remDr$findElements("xpath", "//div/div/div/div[2]/div/span")
sapply(text_elements, function(x) x$getElementText()) %>%
unlist() %>%
as_tibble_col("results") %>%
filter(str_length(results) > 15)
# # A tibble: 6 × 1
# results
# <chr>
# 1 "The meaning of HI is —used especially as a greeting. How to use hi in a sentence."
# 2 "Hi definition, (used as an exclamation of greeting); hello! See more."
# 3 "A friendly, informal, casual greeting said upon someone's arrival. quotations ▽synonyms △. Synonyms: hello, greetings, howdy.…
# 4 "Hi is defined as a standard greeting and is short for \"hello.\" An example of \"hi\" is what you say when you see someone. i…
# 5 "hi synonyms include: hola, hello, howdy, greetings, cheerio, whats crack-a-lackin, yo, how do you do, good morrow, guten tag,…
# 6 "Meaning of hi in English ... used as an informal greeting, usually to people who you know: Hi, there! Hi, how are you doing? …

Webscraping and downloading PDFs in R

I'm trying to loop through different pages of this website https://burnsville.civicweb.net/filepro/documents/25657/ and download all the PDFs to a folder. Because of the way the website is set up, my usual download.file solution wont work. Any other suggestions?
You probably have found a solution by now, but here is my suggestion with rvest and purrrs method of loop. This should work across the Burnsville database, just replace the page variable.
library(tidyverse)
library(rvest)
page <-
"https://burnsville.civicweb.net/filepro/documents/25657/" %>%
read_html
df <- tibble(
names = page %>%
html_nodes(".document-link") %>%
html_text2() %>%
str_remove_all("\r") %>%
str_squish(),
links = page %>%
html_nodes(".document-link") %>%
html_attr("href") %>%
paste0("https://burnsville.civicweb.net/", .)
)
# A tibble: 6 × 2
names links
<chr> <chr>
1 Parks & Natural Resources Commission - 06 Dec 2021 Work Session - M… http…
2 Parks & Natural Resources Commission - 15 Nov 2021 - Minutes - Pdf http…
3 Parks & Natural Resources Commission - 04 Oct 2021 - Minutes - Pdf http…
4 Parks & Natural Resources Commission - 07 Jun 2021 - Minutes - Pdf http…
5 Parks & Natural Resources Commission - 19 Apr 2021 - Minutes - Pdf http…
6 Parks & Natural Resources Commission - 04 Jan 2021 - Minutes - Pdf http…
df %>%
map(~ download.file(links, destfile = paste0(names, ".pdf")))
This worked for me
download.file("https://burnsville.civicweb.net/filepro/documents/36906", "a1.pdf", mode="wb")

Web scraping with R and selector gadget

I am trying to scrape data from a website using R. I am using rvest in an attempt to mimic an example scraping the IMDB page for the Lego Movie. The example advocates use of a tool called Selector Gadget to help easily identify the html_node associated with the data you are seeking to pull.
I am ultimately interested in building a data frame that has the following schema/columns:
rank, blog_name, facebook_fans, twitter_followers, alexa_rank.
My code below. I was able to use Selector Gadget to correctly identity the html tag used in the Lego example. However, following the same process and same code structure as the Lego example, I get NAs (...using firstNAs introduced by coercion[1] NA
). My code is below:
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_node(".stats") %>%
html_text() %>%
as.numeric()
I have also experimented with: html_node("html_node(".stats , .stats span")), which seems to work for the "Facebook fans" column since it reports 714 matches, however only returns 1 number is returned.
714 matches for .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')] | .//*[#class and contains(concat(' ', normalize-space(#class), ' '), ' stats ')]/descendant-or-self::*/span: using first{xml_node}
<td>
[1] <span>997,669</span>
This may help you:
library(rvest)
d1 <- read_html("http://blog.feedspot.com/video_game_news/")
stats <- d1 %>%
html_nodes(".stats") %>%
html_text()
blogname <- d1%>%
html_nodes(".tlink") %>%
html_text()
Note that it is html_nodes (plural)
Result:
> head(blogname)
[1] "Kotaku - The Gamer's Guide" "IGN | Video Games" "Xbox Wire" "Official PlayStation Blog"
[5] "Nintendo Life " "Game Informer"
> head(stats,12)
[1] "997,669" "1,209,029" "873" "4,070,476" "4,493,805" "399" "23,141,452" "10,210,993" "879"
[10] "38,019,811" "12,059,607" "500"
blogname returns the list of blog names that is easy to manage. On the other hand the stats info comes out mixed. This is due to the way the stats class for Facebook and Twitter fans are indistinguishable from one another. In this case the output array has the information every three numbers, that is stats = c(fb, tw, alx, fb, tw, alx...). You should separate each vector from this one.
FBstats = stats[seq(1,length(stats),3)]
> head(stats[seq(1,length(stats),3)])
[1] "997,669" "4,070,476" "23,141,452" "38,019,811" "35,977" "603,681"
You can use html_table to extract the whole table with minimal work:
library(rvest)
library(tidyverse)
# scrape html
h <- 'http://blog.feedspot.com/video_game_news/' %>% read_html()
game_blogs <- h %>%
html_node('table') %>% # select enclosing table node
html_table() %>% # turn table into data.frame
set_names(make.names) %>% # make names syntactic
mutate(Blog.Name = sub('\\s?\\+.*', '', Blog.Name)) %>% # extract title from name info
mutate_at(3:5, parse_number) %>% # make numbers actually numbers
tbl_df() # for printing
game_blogs
#> # A tibble: 119 x 5
#> Rank Blog.Name Facebook.Fans Twitter.Followers Alexa.Rank
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 Kotaku - The Gamer's Guide 997669 1209029 873
#> 2 2 IGN | Video Games 4070476 4493805 399
#> 3 3 Xbox Wire 23141452 10210993 879
#> 4 4 Official PlayStation Blog 38019811 12059607 500
#> 5 5 Nintendo Life 35977 95044 17727
#> 6 6 Game Informer 603681 1770812 10057
#> 7 7 Reddit | Gamers 1003705 430017 25
#> 8 8 Polygon 623808 485827 1594
#> 9 9 Xbox Live's Major Nelson 65905 993481 23114
#> 10 10 VG247 397798 202084 3960
#> # ... with 109 more rows
It's worth checking that everything is parsed like you want, but it should be usable at this point.
This uses html_nodes (plural) and str_replace to remove commas in numbers. Not sure if these are all the stats you need.
library(rvest)
library(stringr)
data2_html = read_html("http://blog.feedspot.com/video_game_news/")
data2_html %>%
html_nodes(".stats") %>%
html_text() %>%
str_replace_all(',', '') %>%
as.numeric()

Resources