I'm trying to scrape two things. I want to extract the links from each individual school on a page with this code:
scraped_links <- read_html("https://www.scholenopdekaart.nl/middelbare-scholen/zoeken/") %>%
html_nodes("a.school-naam") %>%
html_attr("href") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
Then I want to scrape the tabels on these pages:
scraped_tables <- read_html("https://www.scholenopdekaart.nl/Middelbare-scholen/146/1086/Almere-College/Slaagpercentage") %>%
html_nodes(xpath = "/html/body/div[3]/div[3]/div[1]/div[3]/div[3]/div[3]") %>%
html_table() %>%
as.data.frame() %>%
as.tbl()
They both return empty data frames. I tried css selectors, multiple xpaths, but I can't get it to work... Hope someone can help me.
Related
i am having a problem when trying to scrape some data, i have created a function that is properly working, problems occurs when i run this function for many different code.
require ("rvest")
library("dplyr")
getFin = function(ticker)
{
url= paste0("https://it.finance.yahoo.com/quote/",ticker,
"/key-statistics?p=",ticker)
a <- read_html(url)
tbl= a %>% html_nodes("section") %>% html_nodes("div")%>% html_nodes("table")
misureval = tbl %>% .[1] %>% html_table() %>% as.data.frame()
prezzistorici = tbl %>% .[2] %>% html_table() %>% as.data.frame()
titolistat = tbl %>% .[3] %>% html_table() %>% as.data.frame()
dividendi = tbl %>% .[4] %>% html_table() %>% as.data.frame()
annofiscale = tbl %>% .[5] %>% html_table() %>% as.data.frame()
redditivita = tbl %>% .[6] %>% html_table() %>% as.data.frame()
gestione = tbl %>% .[7] %>% html_table() %>% as.data.frame()
contoeco = tbl %>% .[8] %>% html_table() %>% as.data.frame()
bilancio = tbl %>% .[9] %>% html_table() %>% as.data.frame()
flussi = tbl %>% .[10] %>% html_table() %>% as.data.frame()
info1 = rbind(ticker, misureval, prezzistorici, titolistat, dividendi, annofiscale, redditivita, gestione, contoeco, bilancio, flussi)
}
What i am trying to do is to use
finale <- lapply(codici, getFin)
where codici is linked to many different Ticker which will be used in the function to generate one url at time and scrape data.
I have tried with 50 ticker and the function works properly, however when i increase the number i get this error:
Error in xml_nodeset(NextMethod()) : Expecting an external pointer:
[type=NULL].
i don't know if this may be related to the number of request or something other. i have also tested a non existing ticker and the function still works, problems just arises when the number is large.
Solved problem, i just need to add Sys.sleep in order to reduce the frequency of requests.
the best number in this case is 3, so Sys.sleep(3) at the end of the for cycle.
how are you? I am trying to extract some info about this sportbetting webpage using rvest. I asked a related question a few days ago and i get almost 100% of my goals. So far , and thanks to you, extracted succesfully the title, the score and the time of the matches being played using the next code:
library(rvest)
library(tidyverse)
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>%
read_html()
data=data.frame(
Titulo = page %>%
html_elements(".titulo") %>%
html_text(),
Marcador = page %>%
html_elements(".marcador") %>%
html_text(),
Tiempo = page %>%
html_elements(".marcador+ span") %>%
html_text() %>%
str_squish()
)
Now i want to get repeated values, for example if the country of the match is "Brasil" I want to put it in the data frame that the country is Brasil for every match in that category. So far i only managed to extract all the countries but individually. Same applies for sport name and tournament.
Can you help me with that? Already thanks.
You could re-write your code to use separate functions that work with different levels of information. These can be called in a nested fashion making the code easier to read.
Essentially, using nested map_dfr() calls to produce a single dataframe from functions working with lists at different levels within the DOM.
Below, you could think of it like an outer loop of sports, then an intermediate loop over countries, and an innermost loop over events within a sport and country.
library(rvest)
library(tidyverse)
get_sport_info <- function(sport) {
df <- map_dfr(sport %>% html_elements(".category"), get_play_info)
df$sport <- sport %>%
html_element(".sport-name") %>%
html_text()
return(df)
}
get_play_info <- function(play) {
df <- map_dfr(play %>% html_elements(".event"), ~
data.frame(
titulo = .x %>% html_element(".titulo") %>% html_text(),
marcador = .x %>% html_element(".marcador") %>% html_text(),
tiempo = .x %>% html_element(".marcador + span") %>% html_text() %>% str_squish()
))
df$country <- play %>%
html_element(".category-name") %>%
html_text()
return(df)
}
page <- "https://www.supermatch.com.uy/live_recargar_menu/" %>% read_html()
sports <- page %>% html_elements(".sport")
final <- map_dfr(sports, get_sport_info)
I'm trying to scrape the various tables from this webpage: https://www.pro-football-reference.com/years/2020/
When inspecting the elements of the page, I found it easy to obtain the first two tables by using the following code:
### packages
library(tidyverse)
library(rvest)
### Scrape offense
url_off <- read_html("https://www.pro-football-reference.com/years/2020/")
## AFC Standings
url_off %>%
html_table(fill = TRUE) %>%
.[1] %>%
as.data.frame()
## NFC Standings
url_off %>%
html_table(fill = TRUE) %>%
.[2] %>%
as.data.frame()
Where I am stuck is every other table on that page.
For example, the offense table, I can see where it is on the page:
I've tried a few ways of extracting it without any luck. For example:
url_off %>%
html_nodes(".table_outer_container") %>%
html_nodes("#team_stats")
url_off %>%
html_nodes(".table_wrapper") %>%
html_nodes("#team_stats")
This seems to be an issue when I try and extract any of the other tables from that page. The only two tables I can get are the first two (above). I can't figure out where I am going wrong.
I've sorted it out. The data is all stored as a comment, which I think was my issue. Here is how I've extracted the tables, for anyone interested or having similar issues:
url_off %>%
html_nodes('#all_team_stats') %>%
html_nodes(xpath = 'comment()') %>%
html_text() %>%
read_html() %>%
html_node('table') %>%
html_table()
url_off %>%
html_nodes('#all_passing') %>%
html_nodes(xpath = 'comment()') %>%
html_text() %>%
read_html() %>%
html_node('table') %>%
html_table()
I would like to scrape only the candidate names from these tables and the votes that are reported in the third column (after the image, candidate name).
This is as far as I've gotten.
library(rvest)
ndp_leadership<-url('https://en.wikipedia.org/wiki/New_Democratic_Party_leadership_elections')
results<-read_html(ndp_leadership, 'table')
results<-html_nodes(results, 'table')
out<-results %>%
html_nodes(xpath="//*[contains(., 'Candidate')]//tr/td")
out
Although this doesn't really use XPath, here's one way to do it:
results <- read_html(ndp_leadership) %>%
html_nodes(".wikitable") %>%
html_table(fill=TRUE) %>%
map(~ .[,2]) %>%
unlist %>%
setdiff(., c("Candidate", "Total"))
I am working on a project that requires me to go through various pages of links, and within these links find the xml file and parse it. I am having trouble extracting the xml file. There are two xml files within each link and I am interested in the one that is bigger. How can I extract the xml file, and find the the one with the max size. I tried using the grep function but its constantly giving me an error.
sotu<-data.frame()
for (i in seq(1,501, 100))
{
securl <- paste0("https://www.sec.gov/cgi-bin/srch-edgar?text=abs-
ee&start=", i, "&count=100&first=2016")
main.page <- read_html(securl)
urls <- main.page %>%
html_nodes("div td:nth-child(2) a")%>%
html_attr("href")
baseurl <- "https://www.sec.gov"
fulllink <-paste(baseurl, urls, sep = "")
names <- main.page %>%
html_nodes ("div td:nth-child(2) a") %>%
html_text()
date <- main.page %>%
html_nodes ("td:nth-child(5)") %>%
html_text()
result <- data.frame(urls=fulllink,companyname=names,FilingDate=date, stringsAsFactors = FALSE)
sotu<- rbind(sotu,result)
}
for (i in seq(nrow(sotu)))
{
getXML <- read_html(sotu$urls[1]) %>%
grep("xml", getXML, ignore.case=FALSE )
}
Everything works except when I try to loop over every link and find the xml file, I keep getting an error. Is this not the right function?
With some help from dplyr we can do:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(grepl('.*\\.xml', Document)) %>%
filter(Size == max(Size))
})
or, as the type is always 'EX-102' at least in the example:
sotu %>%
rowwise() %>%
do({
read_html(.$urls) %>%
html_table() %>%
as.data.frame() %>%
filter(Type == 'EX-102')
})
This also get rid of the for loop, which is rarely a good idea in R.