rvest help: webscraping empty results - r

I need a help with my webscraping.... can someone save me?
I am trying to get the list of universities in this webpage https://www.whed.net/results_institutions.php for this purpose, I am using the following code:
library(rvest)
library(dplyr)
whed_afg <- "https://www.whed.net/results_institutions.php"
whed_afg1 <- read_html(whed_afg)
whed_afg1
str(whed_afg1)
univ_afg1 = whed_afg1 %>% html_nodes("#results .fancybox\\.iframe") %>% html_text()
univ_afg1
I put double "" on the html_nodes because it was giving me error: Error: '.' is an unrecognized escape in character string starting ""#results .fancybox."
Can someone help me, I do not know what I am doing wrong.
Thank you all,
Ricardo

I think perhaps you have the wrong start url? Or it is behind a login as I get re-directed with your url. I see a full university list on the following url and with a different class for selecting. These could be split by country of interest.
library(rvest)
url <- "https://www.iau-aiu.net/List-of-IAU-Members?lang=en"
universities <- read_html(url) %>% html_nodes('.spip_out') %>% html_text()
print(universities)

Related

Web Scraping in R Timeout

I am doing a project where I need to download FAFSA completion data from this website: https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school
I am using rvest to webscrape that data, but when I try to use the function read_html on the link, it never reads in and eventually I have to stop execution. I can read in other websites, so I'm not sure if it is a website specific issue or if I'm doing something wrong. Here is my code so far:
library(rvest)
fafsa_link <- "https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school"
read_html(fafsa_link)
Any help would be greatly appreciated! Thank you!
An user-agent header is required. The download links are also given in an json file. You could regex out the links (or indeed parse them out); or as I do, regex out one then substitute the state code within that to get the additional download url (given urls only vary in this aspect)
library(magrittr)
library(httr)
library(stringr)
data <- httr::GET('https://studentaid.gov/data-center/student/application-volume/fafsa-completion-high-school.json', add_headers("User-Agent" = "Mozilla/5.0")) %>%
content(as = "text")
ca <- data %>% stringr::str_match(': "(.*?CA\\.xls)"') %>% .[2] %>% paste0('https://studentaid.gov', .)
ma <- gsub('CA\\.xls', 'MA\\.xls' ,ca)

Rvest and xpath returns misleading information

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:

Scraping dynamic information in R

I'm trying to use an xpath to scrape a figure I need on this website
I need these two numbers
So far I'm having no luck. Any help appreciated.
Does it need to be xPath? You can get it with:
library(rvest)
page <- read_html("http://www.myfxbook.com/community/outlook/EURUSD")
page %>% html_nodes("#leftColumn td:nth-child(4)") %>% html_text()

Applying rvest pipes to a dataframe

I've got a dataframe called base_table with a lot of 311 data and URLs that point to a broader description of each call.
I'm trying to create a new variable called case_desc with a series of rvest functions each URL.
base_table$case_desc <-
read_html(base_table$case_url) %>%
html_nodes("rc_descrlong") %>%
html_text()
But this doesn't work for I suppose obvious reasons that I can't muster right now. I've tried playing around with functions, but can't seem to nail the right format.
Any help would be awesome! Thank you!
It doesn't work because read_html doesn't work with a vector of URLs. It will throw an error if you give it a vector...
> read_html(c("http://www.google.com", "http://www.yahoo.com"))
Error: expecting a single value
You probably have to use an apply function...
library("rvest")
base_table$case_desc <- sapply(base_table$case_url, function(x)
read_html(x) %>%
html_nodes("rc_descrlong") %>%
html_text())

Resources