Rvest and xpath returns misleading information - r

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

Related

rvest help: webscraping empty results

I need a help with my webscraping.... can someone save me?
I am trying to get the list of universities in this webpage https://www.whed.net/results_institutions.php for this purpose, I am using the following code:
library(rvest)
library(dplyr)
whed_afg <- "https://www.whed.net/results_institutions.php"
whed_afg1 <- read_html(whed_afg)
whed_afg1
str(whed_afg1)
univ_afg1 = whed_afg1 %>% html_nodes("#results .fancybox\\.iframe") %>% html_text()
univ_afg1
I put double "" on the html_nodes because it was giving me error: Error: '.' is an unrecognized escape in character string starting ""#results .fancybox."
Can someone help me, I do not know what I am doing wrong.
Thank you all,
Ricardo
I think perhaps you have the wrong start url? Or it is behind a login as I get re-directed with your url. I see a full university list on the following url and with a different class for selecting. These could be split by country of interest.
library(rvest)
url <- "https://www.iau-aiu.net/List-of-IAU-Members?lang=en"
universities <- read_html(url) %>% html_nodes('.spip_out') %>% html_text()
print(universities)

How to scrape NBA data?

I want to compare rookies across leagues with stats like Points per game (PPG) and such. ESPN and NBA have great tables to scrape from (as does Basketball-reference), but I just found out that they're not stored in html, so I can't use rvest. For context, I'm trying to scrape tables like this one (from NBA):
https://i.stack.imgur.com/SdKjE.png
I'm trying to learn how to use HTTR and JSON for this, but I'm running into some issues. I followed the answer in this post, but it's not working out for me.
This is what I've tried:
library(httr)
library(jsonlite)
coby.white <- GET('https://www.nba.com/players/coby/white/1629632')
out <- content(coby.white, as = "text") %>%
fromJSON(flatten = FALSE)
However, I get an error:
Error: lexical error: invalid char in json text.
<!DOCTYPE html><html class="" l
(right here) ------^
Is there an easier way to scrape a table from ESPN or NBA, or is there a solution to this issue?
ppg and others stats come from]
https://data.nba.net/prod/v1/2019/players/1629632_profile.json
and player info e.g. weight, height
https://www.nba.com/players/active_players.json
So, you could use jsonlite to parse e.g.
library(jsonlite)
data <- jsonlite::read_json('https://data.nba.net/prod/v1/2019/players/1629632_profile.json')
You can find these in the network tab when refreshing the page. Looks like you can use the player id in the url to get different players info for the season.
You actually can web scrape with rvest, here's an example of scraping White's totals table from Basketball Reference. Anything on Sports Reference's sites that is not the first table of the page is listed as a comment, meaning we must extract the comment nodes first then extract the desired data table.
library(rvest)
library(dplyr)
cobywhite = 'https://www.basketball-reference.com/players/w/whiteco01.html'
totalsdf = cobywhite %>%
read_html %>%
html_nodes(xpath = '//comment()') %>%
html_text() %>%
paste(collapse='') %>%
read_html() %>%
html_node("#totals") %>%
html_table()

Rvest is unable to find the node specified by css selector, how do I fix it?

I am scraping data from this website and for some reason, I'm unable to get the name of the seller, even though I use the exact node returned by SelectorGadget. I have, however, managed to get all the other data with Rvest.
I managed to scrape the seller's name with RSelenium but that takes too much time. Anyway, here's the link of the page I'm scraping:
https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946
Here's the code I've used
SellerName <-
read_html("https://kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946") %>%
html_nodes(".link-4200870613") %>%
html_text()
You can regex out the seller name easily from the return as it is contained in a script tag (presumably loaded from here when browser is able to run javascript - which rvest does not.)
library(rvest)
library(magrittr)
library(stringr)
p <- read_html('https://www.kijiji.ca/v-fitness-personal-trainer/bedford/swimming-lessons/1421292946') %>% html_text()
seller_name <- str_match_all(p,'"sellerName":"(.*?)"')[[1]][,2][1]
print(seller_name)
Regex:

R programming Web Scraping

I tried to scrape webpage from the below link using R vest package from R programming.
The link that I scraped is http://dk.farnell.com/c/office-computer-networking-products/prl/results
My code is:
library("xml2")
library("rvest")
url<-read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results")
tbls_ls <- url %>%
html_nodes("table") %>%
html_table(fill = TRUE)%>%
gsub("^\\s\\n\\t+|\\s+$n+$t+$", "", .)
View(tbls_ls)
My requirement is that I want to remove \\n,\\t from the result. I want to give pagination to scrape multiple pages, so that I can scrape this webpage with pagination.
I'm intrigued by these kinds of questions so I'll try to help you out. Be forewarned, I am not an expert with this stuff (or anything close to it). Anyway, I think it should be kind of like this...
library(rvest)
library(rvest)
library(tidyverse)
urls <- read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results/")
pag <- 1:5
read_urls <- paste0(urls, pag)
read_urls %>%
map(read_html) -> p
Now, I didn't see any '\\n' or '\\t' patterns in the data sets. Nevertheless, if you want to look for a specific string, you can do it like this.
library(stringr)
str_which(urls, "[your]string_here")
The link below is very useful!
http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/webscrape.html

Difference between GET(), read_html(), getURL() in R

I'm attempting to scrape a realtor.com for a project for school. I have a working solution, which entails using a combination of the rvest and httr packages, but I want to migrate it to using the RCurl package, specifically using the getURLAsynchronous() function. I know that my algorithm will scrape much faster if I can migrate it to a solution that will download multiple URLs at once. My current solution to this problem is as follows:
Here's what I have so far:
library(RCurl)
library(rvest)
library(httr)
urls <- c("http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-1?pgsz=50",
"http://www.realtor.com/realestateandhomes-search/Denver_CO/pg-2?pgsz=50")
prop.info <- vector("list", length = 0)
for (j in 1:length(urls)) {
prop.info <- c(prop.info, urls[[j]] %>% # Recursively builds the list using each url
GET(add_headers("user-agent" = "r")) %>%
read_html() %>% # creates the html object
html_nodes(".srp-item-body") %>% # grabs appropriate html element
html_text()) # converts it to a text vector
}
This gets me an output that I can readily work with. I'm getting all of the information off of the webpages, then reading all of the html from the GET() output. Next, I'm finding the html nodes, and converting it to text. The trouble I'm running into is when I attempt to implement something similar using RCurl.
Here is what I have for that using the same URLs:
getURLAsynchronous(urls) %>%
read_html() %>%
html_node(".srp-item-details") %>%
html_text
When I call getURIAsynchronous() on the urls vector, not all of the information is downloaded. I'm honestly not sure exactly what is being scraped. However, I know it's considerably different then my current solution.
Any ideas what I'm doing wrong? Or maybe an explanation on how getURLAsynchronous() should be working?

Resources