Using R to scrape play-by-play data - r

I am currently trying to scrape the play-by-play entries from the following link:
https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4
I used the SelectorGadget to determine CSS selectors and ended up with '//td'. However when I attempt to scrape the data using this, html_nodes() returns an empty list and thus the following code returns an error.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_node(xpath='//td') %>%
html_table()
play_by_play
Does anybody know how to resolve this issue?
Thank you in advance!

I think you cannot get the table simply because there are no table in the website(see the source).
It there are any tables, you can get it with following code.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_table()
play_by_play

The data in the page you are loading is loaded with Javascript, so when you used read_html, you are not seeing what you want. If you check "view the source", you will not see table or td in the source page.
What you can do is using other options like Rselenium to get the page source, and if you want to use rvest later you can scrape from the source you get.
library(rvest)
library(Rselenium)
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
rD<- rsDriver()
remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]
play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
html_nodes("td")
remDr$close()
rm(remDr, rD)
gc()

Related

Issue webscraping a website: not extracting anything

I am trying to extract data from the following website: 'https://2010-2014.kormany.hu/hu/hirek'. When I try to extract, for example, the articles' links from that website using the following, I got nothing.
library(rvest)
library(dplyr)
library(XML)
url <- 'www.2015-2019.kormany.hu/hu/hirek'
links <- read_html(url) %>% html_nodes("div") %>% html_nodes(xpath = '//*[#class="article"]') %>% html_nodes("h2") %>% html_nodes("a") %>% html_attr("href")
links
> character(0)
I don't even get anything if I run the following code:
links <- read_html(url) %>% html_nodes("div")
links
> character(0)
This is very strange since, when I inspect the website, it seems that I should be getting the list of URLs from the code I provided. According to the website's source, there are "div" nodes ('view-source:https://2015-2019.kormany.hu/hu/hirek'). Does anyone know what I could be doing wrong?
Today I re-tried my code and it works perfectly. I am not sure what was happening yesterday.

How to read specific tags using XML2

Problem
I am trying to get all of the url's in https://www.ato.gov.au/sitemap.xml (N.B it's a ~9mb file) using xml2. Any pointers appreciated.
My attempt
library("xml2")
data1 <- read_xml("https://www.ato.gov.au/sitemap.xml")
xml_find_all(data, ".//loc")
I'm not getting the output I need:
{xml_nodeset (0)}
Not using xml2 but I was able to get it using rvest
library(dplyr)
library(rvest)
url <- "https://www.ato.gov.au/sitemap.xml"
url %>%
read_html() %>%
html_nodes("loc") %>%
html_text()
Just in case you need all the urls in dataframe you can use below code:
library(XML)
library(xml2)
library(httpuv)
library(httr)
library(RCurl)
library(data.table)
library(dplyr)
url <- "https://www.ato.gov.au/sitemap.xml"
xData <- getURL(url)
doc <- xmlParse(xData)
data<-xmlToList(doc)
a<-as.data.frame(unlist(data))
a<-dplyr::filter(a,grepl("http",`unlist(data)`) )
head(a)
Above code will give you a dataframe with list of all urls. I was just wondering you can also use "Xenu" url fetcher software to extract urls from website which are not included in sitemap.
Let me know in case you stuck somewhere in middle.

Web scraping data for use in R-Studio

I am wanting to pull the data out of this server site and into R-Studio. I am new to R so not at all sure what is possible. Any help with coding to achieve this would be appreciated.
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=679&samples=true
install.packages("rvest")
library('rvest')
install.packages('XML')
library('XML')
library("httr")
#Specifying the url for desired website to be scrapped
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-
bin/hydwebserver.cgi/points/samples?point=679'
webpage <- read_html(url)
tbls <- html_nodes(webpage, "table")
head(tbls)
tbls_ls <- webpage %>%
html_nodes("table") %>%
html_table(fill = TRUE)
tbl <- as.data.frame(tbls_ls)
View(tbl)
I have tried to fetch few other tables from the given website which is working fine.
for example:
rainfall depth:
http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/hydwebserver.cgi/points/details?point=63
small modification in the url as follows will fetch you actual table. rest all code reamins same (details?point=63 as samples?point=63)
url <- 'http://hbrcdata.hbrc.govt.nz/hydrotel/cgi-bin/HydWebServer.cgi/points/samples?point=63'
for more help you can refer the website:
http://bradleyboehmke.github.io/2015/12/scraping-html-tables.html

Rselenium xpath not able to save response

I'm trying to get the stocks from https://www.vinmonopolet.no/
for example this wine https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301
Using Rselenium
library('RSelenium')
rD=rsDriver()
remDr =rD[["client"]]
remDr$navigate("https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301")
webElement = remDr$findElement('xpath', '//*[#id="product_2953010"]/span[2]')
webElement$clickElement()
It will render Response
But how to store it?
Full XML
Maybe be rvest is what you are looking for?
library(rvest, tidyverse)
url <- "https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301"
page <- read_html(url)
stock <- page %>%
html_nodes(".product-stock-status div") %>%
html_text()
stock.df <- data.frame(url,stock)
To extract the number use
stock.df <- stock.df %>%
mutate(stock=as.numeric(gsub(".*?([0-9]+).*", "\\1", stock)))
Got it to work just sending the right plain request no need of R
https://www.vinmonopolet.no/vmp/store-pickup/1101/pointOfServices?locationQuery=0661&cartPage=false&entryNumber=0&CSRFToken=718228c1-1dc1-41cd-a35e-23197bed7b0c

rvest + selector gadget return empty list

I'm attempting to scrape political endorsement data from wikipedia tables (a pretty generic scraping task) and the regular process of using rvest on the css path identified by selector gadget is failing.
The wiki page is here, and the css path .jquery-tablesorter:nth-child(11) td seems to select the right part of the page
Armed with the css, I would normally just use rvest to directly access these data, as follows:
"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012" %>%
html %>%
html_nodes(".jquery-tablesorter:nth-child(11) td")
but this returns:
list()
attr(,"class")
[1] "XMLNodeSet"
Do you have any ideas?
This might help:
library(rvest)
URL <- "https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>% read_html %>%
html_node("table.wikitable:nth-child(11)") %>% html_table()
This code stores the table that you requested as a dataframe in the variable tab.
> View(tab)
I find that if I use the xpath suggestion from Chrome it works.
Chrome suggests an xpath of //*[#id="mw-content-text"]/table[4]
I can then run as follows
library(rvest)
URL <-"https://en.wikipedia.org/wiki/Endorsements_for_the_Republican_Party_presidential_primaries,_2012"
tab <- URL %>%
read_html %>%
html_node(xpath='//*[#id="mw-content-text"]/table[4]') %>%
html_table()

Resources