rvest only returns header from scraped table - r

The following only returns the headers from desired table that was scraped using rvest.
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft <- read_html(url)
draft_first_html <- html_nodes(draft,xpath = '//*[#id="div_draft_stats"]')
I've tried a few different xpaths with no luck. It should return 36 observations and 24 variables.

This works for me after correcting your URL:
draft <- read_html(url)
draft %>%
html_node("#draft_stats") %>%
html_table()

You were close to the answer. You just needed to correct the id to get the proper html node. Then using html_table() on that node will give you the data you want. My try at the solution:
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft <- read_html(url)
draft_first_html <- html_node(draft,xpath = '//*[#id="draft_stats"]')
draft_df <- html_table(draft_first_html)
A cleaner solution with less code would be:
library(rvest)
url <-("https://www.baseball-reference.com/draft/?year_ID=2017&draft_round=1&draft_type=junreg&query_type=year_round&from_type_jc=0&from_type_hs=0&from_type_4y=0&from_type_unk=0")
draft_df <- read_html(url) %>%
html_node(xpath = '//*[#id="draft_stats"]') %>%
html_table()
Hope it helped! I didn't check the terms and conditions of the webpage, but always be sure that you are respecting the terms before scraping :)
If there is anything that you don't understand about my solution, don't hesitate to leave a comment below!

Related

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?
Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

Rvest and xpath returns misleading information

I am struggling with some scraping issues, using rvest and xpath.
The objective is to scrape the following page
https://www.barchart.com/futures/quotes/BT*0/futures-prices
and to extract the names of the futures
BTF21
BTG21
BTH21
etc for the full list of names.
The xpath for those variables seem to be xpath='//a'.
The following code provides no information of relevance, thus my query
library(rvest)
url <- 'https://www.barchart.com/futures/quotes/BT*0'
valuation_col <- url %>%
read_html() %>%
html_nodes(xpath='//a')
value <- valuation_col %>% html_text()
Any hint to proceed further to get the information would be much needed. Thanks in advance!

Using R to scrape play-by-play data

I am currently trying to scrape the play-by-play entries from the following link:
https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4
I used the SelectorGadget to determine CSS selectors and ended up with '//td'. However when I attempt to scrape the data using this, html_nodes() returns an empty list and thus the following code returns an error.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_node(xpath='//td') %>%
html_table()
play_by_play
Does anybody know how to resolve this issue?
Thank you in advance!
I think you cannot get the table simply because there are no table in the website(see the source).
It there are any tables, you can get it with following code.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_table()
play_by_play
The data in the page you are loading is loaded with Javascript, so when you used read_html, you are not seeing what you want. If you check "view the source", you will not see table or td in the source page.
What you can do is using other options like Rselenium to get the page source, and if you want to use rvest later you can scrape from the source you get.
library(rvest)
library(Rselenium)
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
rD<- rsDriver()
remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]
play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
html_nodes("td")
remDr$close()
rm(remDr, rD)
gc()

get links while do web scraping to google in R

I am trying to get links of google while do a search, that is, all these links:.
I have done this kind of scraping but in this case I do not understand why It doesn't work, so I run the following lines:
library(rvest)
url<-"https://www.google.es/search?q=Ediciones+Peña+sl+telefono"
content_request<-read_html(url)
content_request %>%
html_nodes(".r") %>%
html_attr("href")
I have tried with other nodes and I obtain similar answers:
content_request %>%
html_nodes(".LC20lb") %>%
html_attr("href")
Finally I tried to get all the links of the web page, but there are some links that I cannot download:
html_attr(html_nodes(content_request, "a"), "href")
Please, could you help me in this case? Thank you.
Here are two options for you to play around with.
#1)
url <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
html <- paste(readLines(url), collapse="\n")
library(stringr)
matched <- str_match_all(html, "<a href=\"(.*?)\"")
#2)
library(xml2)
library(rvest)
URL <- "https://www.google.es/search?q=Ediciones+Pe%C3%B1a+sl+telefono"
pg <- read_html(URL)
head(html_attr(html_nodes(pg, "a"), "href"))

R programming Web Scraping

I tried to scrape webpage from the below link using R vest package from R programming.
The link that I scraped is http://dk.farnell.com/c/office-computer-networking-products/prl/results
My code is:
library("xml2")
library("rvest")
url<-read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results")
tbls_ls <- url %>%
html_nodes("table") %>%
html_table(fill = TRUE)%>%
gsub("^\\s\\n\\t+|\\s+$n+$t+$", "", .)
View(tbls_ls)
My requirement is that I want to remove \\n,\\t from the result. I want to give pagination to scrape multiple pages, so that I can scrape this webpage with pagination.
I'm intrigued by these kinds of questions so I'll try to help you out. Be forewarned, I am not an expert with this stuff (or anything close to it). Anyway, I think it should be kind of like this...
library(rvest)
library(rvest)
library(tidyverse)
urls <- read_html("http://dk.farnell.com/c/office-computer-networking-products/prl/results/")
pag <- 1:5
read_urls <- paste0(urls, pag)
read_urls %>%
map(read_html) -> p
Now, I didn't see any '\\n' or '\\t' patterns in the data sets. Nevertheless, if you want to look for a specific string, you can do it like this.
library(stringr)
str_which(urls, "[your]string_here")
The link below is very useful!
http://dept.stat.lsa.umich.edu/~jerrick/courses/stat701/notes/webscrape.html

Resources