Yahoo Finance - Web scraping with R - SelectorGadget doesn't work - r

I am trying to get the 1980 matching stocks from this Yahoo Finance Screener:
[https://finance.yahoo.com/screener/unsaved/38a77251-0996-439b-8be4-9d10ff18ff79?count=25&offset=0]
using R and rvest.
I normally use XPath but I can't get it with SelectorGadget at this website.
Could somebody help me about an alternative way to get al pages with those data.
I wanted to have code similar yo this one and that worked with Investing. Please note that the Symbol, Name, and MarketCap codes are just examples:
library(rvest)
library(dplyr)
i=0
for(z in 1:80){
url_base<-paste("https://finance.yahoo.com/screener/unsaved/38a77251-0996-439b-8be4-9d10ff18ff79?count=25&offset=0")
zpg <- read_html(url_base)
Symbol<-zpg %>% html_nodes("table") %>% html_nodes("span") %>% html_attr("data-id" )
Name<-zpg %>% html_nodes("table") %>% html_nodes("span") %>% html_attr("data-name" )
MarketCap<-zpg %>% html_nodes("table") %>% html_nodes("span") %>% html_attr("data-name" )
data<-data.frame(WebID,FullName,MarketCap)
if(i==0){
USA<-data}else{
USA<-rbind(USA,data)
}
i=i+1
}

You could try using quantmod or tidyquant.
library(tidyverse)
library(tidyquant)
# getting symbols for NASDAQ
nasdaq <- read_delim("https://nasdaqtrader.com/dynamic/SymDir/nasdaqlisted.txt", delim = "|")
# scraping the data
df <- nasdaq %>%
head() %>% # to fetch only a few rows
rowwise() %>%
mutate(data = list(tq_get(Symbol, from = '2020-08-01', to = "2020-08-07", warnings = FALSE)))
# getting the data ready
df2 <- df$data %>%
bind_rows()

Related

Error in xml_nodeset(NextMethod()) : Expecting an external pointer: [type=NULL] when scraping with RVEST

i am having a problem when trying to scrape some data, i have created a function that is properly working, problems occurs when i run this function for many different code.
require ("rvest")
library("dplyr")
getFin = function(ticker)
{
url= paste0("https://it.finance.yahoo.com/quote/",ticker,
"/key-statistics?p=",ticker)
a <- read_html(url)
tbl= a %>% html_nodes("section") %>% html_nodes("div")%>% html_nodes("table")
misureval = tbl %>% .[1] %>% html_table() %>% as.data.frame()
prezzistorici = tbl %>% .[2] %>% html_table() %>% as.data.frame()
titolistat = tbl %>% .[3] %>% html_table() %>% as.data.frame()
dividendi = tbl %>% .[4] %>% html_table() %>% as.data.frame()
annofiscale = tbl %>% .[5] %>% html_table() %>% as.data.frame()
redditivita = tbl %>% .[6] %>% html_table() %>% as.data.frame()
gestione = tbl %>% .[7] %>% html_table() %>% as.data.frame()
contoeco = tbl %>% .[8] %>% html_table() %>% as.data.frame()
bilancio = tbl %>% .[9] %>% html_table() %>% as.data.frame()
flussi = tbl %>% .[10] %>% html_table() %>% as.data.frame()
info1 = rbind(ticker, misureval, prezzistorici, titolistat, dividendi, annofiscale, redditivita, gestione, contoeco, bilancio, flussi)
}
What i am trying to do is to use
finale <- lapply(codici, getFin)
where codici is linked to many different Ticker which will be used in the function to generate one url at time and scrape data.
I have tried with 50 ticker and the function works properly, however when i increase the number i get this error:
Error in xml_nodeset(NextMethod()) : Expecting an external pointer:
[type=NULL].
i don't know if this may be related to the number of request or something other. i have also tested a non existing ticker and the function still works, problems just arises when the number is large.
Solved problem, i just need to add Sys.sleep in order to reduce the frequency of requests.
the best number in this case is 3, so Sys.sleep(3) at the end of the for cycle.

Extracting repeated class with rvest html_elements in R

how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)

Add a purrr:slowly or sys.sleep in a map

I created a function to scrape the fbref.com website. Sometimes in this website or in others that I'm trying to scrape I receive a timeout error. I read about it and it is suggested to include a sys.sleep between the requisitions or a purrr:slowly. I tried to include inside the map but I could not. How can I include a 10 seconds gap between each requisition inside the map (It would be 7 requisitions and 6 intervals of 10 seconds).
Thanks in advance and if I do not include something please inform me!
#packages
library(rvest)
library(stringr)
library(dplyr)
library(tidyr)
library(purrr)
library(lubridate)
library(tm)
#function
funcao_extract <- function(link1,link2){
fbref <- "https://fbref.com/pt/equipes/"
dados <- fbref %>%
map2_chr(rep(link1,7),paste0) %>%
map2_chr(seq(from=2014,to=2020),paste0) %>%
map2_chr(rep(link2,7),paste0) %>%
map(. %>%
read_html() %>%
html_table() %>%
.[[1]] %>% #select first table
dplyr::bind_rows()%>%
janitor::clean_names() %>%
slice(-1) %>%
select(-21) %>%
rename(nome=1,nacionalidade=2,posicao=3,idade=4,jogos=5,inicios=6,minutos=7,minutos_90=8,gol=9,assistencia=10,
gol_normal=11,gol_penalti=12,penalti_batido=13,amarelo=14,vermelho=15,gol_90=16,assistencia_90=17,
gol_assistencia_90=18,gol_normal_90=19,gol_assistencia_penalti_90=20) %>%
as.data.frame() %>%
format(scientific=FALSE) %>%
mutate_at(.,c(5:15),as.numeric) %>%
mutate(nacionalidade = str_extract(nacionalidade, "[A-Z]+")) #only capital letters
) %>% #renomear as colunas
setNames(paste0(rep("Gremio_",7),seq(from=2014,to=2020))) #name lists
}
#test with GrĂªmio
gremio <- funcao_extract("d5ae3703/","/Gremio-Estatisticas")
Just a small example
library(tidyverse)
example <- list(data.frame(a=1:10), data.frame(a=11:20))
example %>%
map_df(~ {Sys.sleep(10) ;
message(Sys.time());.x} %>%
summarise(a = sum(a)))

Web Scraping: R rvest html_nodes return character(0)

Below is the code. The strange thing is that I could run it without problems yesterday, but today always returns character(0). After checking through it, I found it's because of the html_nodes line.
I try to replace ".photo-cards li article" with other nodes, it's still not working.
Did somebody meet the same problem and fix it? Thank you in advance for your help!
library(tidyverse)
library(rvest)
links <- sprintf("https://www.zillow.com/sacramento-ca/%d_p", 1:11)
results <- map(links, ~ {
# http://selectorgadget.com/
# <body
# class="photo-cards
houses <- read_html(.x) %>%
html_nodes(".photo-cards li article")
z_id <- houses %>%
html_attr("id")
address <- houses %>%
html_node(".list-card-addr") %>%
html_text()
price <- houses %>%
html_node(".list-card-price") %>%
html_text() %>%
readr::parse_number()
params <- houses %>%
html_node(".list-card-info") %>%
html_text()
# number of bedrooms
beds <- params %>%
str_extract("\\d+(?=\\s*bds)") %>%
as.numeric()
# number of bathrooms
baths <- params %>%
str_extract("\\d+(?=\\s*ba)") %>%
as.numeric()
# total square footage
house_a <- params %>%
str_extract("[0-9,]+(?=\\s*sqft)") %>%
str_replace(",", "") %>%
as.numeric()
tibble(price = price, beds= beds, baths=baths, house_area = house_a)
}
) %>%
bind_rows(.id = 'page_no')

How to skip a page scrape when table is missing in R

I'm building a scrape that pulls the name of a player and the years he played for thousands of different players. I have built an otherwise successful function to do this but unfortunately in some instances the table with the other half of data I need (years played) does not exist. For these instances, I'd like to add a way to tell the scrape to bypass these instances. Here is the code:
(note: the object "url_final" is the list of active webpage URLs of which there are many)
library(rvest)
library(curl)
library(tidyverse)
library(httr)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
fyr <- read_html(curl(x, handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>%
.[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x, pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
})
Here is an example of an active page in which you can recreate the scrape by replacing "url_final" as the .x call in map_dfr with:
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
Here is an example of one of the instances in which there is no table and thus returns an error breaking the loop of the scrape.
https://www.sports-reference.com/cbb/players/karl-aaker-1.html
How about adding try-Catch which will ignore any errors?
library(tidyverse)
library(rvest)
df <- map_dfr(.x = url_final,
.f = function(x){Sys.sleep(.3); cat(1);
tryCatch({
fyr <- read_html(curl::curl(x,
handle = curl::new_handle("useragent" = "Mozilla/5.0"))) %>%
html_table() %>% .[[1]]
fyr <- fyr %>%
select(1) %>%
mutate(name = str_extract(string = x,
pattern = "(?<=cbb/players/).*?(?=-\\d\\.html)"))
}, error = function(e) message('Skipping url', x))
})

Resources