Rselenium xpath not able to save response - r

I'm trying to get the stocks from https://www.vinmonopolet.no/
for example this wine https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301
Using Rselenium
library('RSelenium')
rD=rsDriver()
remDr =rD[["client"]]
remDr$navigate("https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301")
webElement = remDr$findElement('xpath', '//*[#id="product_2953010"]/span[2]')
webElement$clickElement()
It will render Response
But how to store it?
Full XML

Maybe be rvest is what you are looking for?
library(rvest, tidyverse)
url <- "https://www.vinmonopolet.no/vmp/Land/Chile/Gato-Negro-Cabernet-Sauvignon-2017/p/295301"
page <- read_html(url)
stock <- page %>%
html_nodes(".product-stock-status div") %>%
html_text()
stock.df <- data.frame(url,stock)
To extract the number use
stock.df <- stock.df %>%
mutate(stock=as.numeric(gsub(".*?([0-9]+).*", "\\1", stock)))

Got it to work just sending the right plain request no need of R
https://www.vinmonopolet.no/vmp/store-pickup/1101/pointOfServices?locationQuery=0661&cartPage=false&entryNumber=0&CSRFToken=718228c1-1dc1-41cd-a35e-23197bed7b0c

Related

R Curl is it possible to wait several seconds before Scrape Javascript content

I am using RCurl to scrape Sentiment data, but I need to make it wait several seconds first before it scraped, this is my initial code:
library(stringr)
library(curl)
links <- "https://www.dailyfx.com/sentiment"
con <- curl(links)
open(con)
html_string <- readLines(con, n = 3000)
html_string[1580:1700] #The data value property is "--" in this case
How to add the waiting seconds properly?
Special thanks for #MrFlick appointing the situations
curl will only pull the source code for that web page. The data that
is shown on that page is loaded via javascript after the page loads;
it is not contained in the page source. If you want to interact with a
page that uses javascript, you'll need to use something like RSelenium
instead. Or you'll need to reverse engineer the javascript to see
where the data is coming from and then perhaps make a curl request to
the data endpoint directly rather than the HTML page
With that said, I use RSelenium to accomplish this to a desired way:
library(RSelenium)
library(rvest)
library(tidyverse)
library(stringr)
rD <- rsDriver(browser="chrome", verbose=F, chromever = "103.0.5060.134")
remDr <- rD[["client"]]
remDr$navigate("https://www.dailyfx.com/sentiment")
Sys.sleep(10) #Give the page fully loaded
html <- remDr$getPageSource()[[1]]
html_obj <- read_html(html)
#Take Buy and Sell Sentiment of Specific Assets
buy_sentiment <- html_obj %>%
html_nodes(".dfx-technicalSentimentCard__netLongContainer") %>%
html_children()
buy_sentiment <- as.character(buy_sentiment[[15]])
buy_sentiment <- as.numeric(str_match(buy_sentiment, "[0-9]+"))
sell_sentiment <- html_obj %>%
html_nodes(".dfx-technicalSentimentCard__netShortContainer") %>%
html_children()
sell_sentiment <- as.character(sell_sentiment[[15]])
sell_sentiment <- as.numeric(str_match(sell_sentiment, "[0-9]+"))

Using R to scrape play-by-play data

I am currently trying to scrape the play-by-play entries from the following link:
https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4
I used the SelectorGadget to determine CSS selectors and ended up with '//td'. However when I attempt to scrape the data using this, html_nodes() returns an empty list and thus the following code returns an error.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_node(xpath='//td') %>%
html_table()
play_by_play
Does anybody know how to resolve this issue?
Thank you in advance!
I think you cannot get the table simply because there are no table in the website(see the source).
It there are any tables, you can get it with following code.
library("rvest")
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
play_by_play <- url %>%
read_html %>%
html_table()
play_by_play
The data in the page you are loading is loaded with Javascript, so when you used read_html, you are not seeing what you want. If you check "view the source", you will not see table or td in the source page.
What you can do is using other options like Rselenium to get the page source, and if you want to use rvest later you can scrape from the source you get.
library(rvest)
library(Rselenium)
url <- "https://www.basket.fi/basketball-finland/competitions/game/?game_id=4677793&season_id=110531&league_id=4"
rD<- rsDriver()
remDr <- rD$client
remDr$navigate(url)
remDr$getPageSource()[[1]]
play_by_play <-read_html(unlist(remDr$getPageSource()),encoding="UTF-8") %>%
html_nodes("td")
remDr$close()
rm(remDr, rD)
gc()

Scrape webpage using R and Chrome

I am trying to pull the table from this website into R by using path from Chrome inspection, but it does not work. Could you help me with that? Thanks.
library(rvest)
library(XML)
url <- "https://seekingalpha.com/symbol/MNHVF/profitability"
webpage <- read_html(url)
rank_data_html <- html_nodes(webpage, 'section#cresscap') # table.cresscap-table
rank_data <- html_table(rank_data_html)
rank_data1 <- rank_data[[1]]
Data comes from an additional xhr call made dynamically by the page. You can make a request to this and handle json response with jsonlite. Extract the relevant list of lists and use dplyr bind_rows to generate your output. You can rename columns to match those on page if you want.
library(jsonlite)
library(dplyr)
data <- jsonlite::read_json('https://seekingalpha.com/symbol/MNHVF/cresscap/fields_ratings?category_id=4&sa_pro=false')
df <- bind_rows(data$fields)
head(df)

How to read specific tags using XML2

Problem
I am trying to get all of the url's in https://www.ato.gov.au/sitemap.xml (N.B it's a ~9mb file) using xml2. Any pointers appreciated.
My attempt
library("xml2")
data1 <- read_xml("https://www.ato.gov.au/sitemap.xml")
xml_find_all(data, ".//loc")
I'm not getting the output I need:
{xml_nodeset (0)}
Not using xml2 but I was able to get it using rvest
library(dplyr)
library(rvest)
url <- "https://www.ato.gov.au/sitemap.xml"
url %>%
read_html() %>%
html_nodes("loc") %>%
html_text()
Just in case you need all the urls in dataframe you can use below code:
library(XML)
library(xml2)
library(httpuv)
library(httr)
library(RCurl)
library(data.table)
library(dplyr)
url <- "https://www.ato.gov.au/sitemap.xml"
xData <- getURL(url)
doc <- xmlParse(xData)
data<-xmlToList(doc)
a<-as.data.frame(unlist(data))
a<-dplyr::filter(a,grepl("http",`unlist(data)`) )
head(a)
Above code will give you a dataframe with list of all urls. I was just wondering you can also use "Xenu" url fetcher software to extract urls from website which are not included in sitemap.
Let me know in case you stuck somewhere in middle.

How to extract id names from search result urls using rvest? (CSS selector isn't working)

I'm attempting to extract a list of product item names from a search result page (link here).
library(rvest)
results <- read_html('https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=')
results %>%
html_nodes(".result_title a") %>%
html_text()
which returns
character(0)
I've also attempted to make use of:
html_attr('href')
with no luck. Can I even use css to pull the titles of these links? I'm trying to make a list of the 30 product results (e.g. "Sodium Hydroxide (Pellets/Certified ACS), Fisher Chemical"). Is the id for these links using javascript?
Thanks for any help, this is my first scraping project and my knowledge of web design is much simpler than this particular page.
The result is indeed generated with javascript. rvest doesn't handle javascript at the moment, but other alternatives exists.
For example, you can use selenium and phantomjs to get to what you want :
library(RSelenium) # Wrapper around Selenium
library(wdman) # helper to download and configure phantomjs
library(rvest)
phantomjs <- phantomjs(port = 4444L)
remote_driver <- remote_driver(browserName = "phantomjs", port = 4444L)
remote_driver <- remoteDriver(browserName = "phantomjs", port = 4444L)
remote_driver$open(silent = TRUE)
remote_driver$navigate("https://www.fishersci.com/us/en/catalog/search/products?keyword=sodium+hydroxide&nav=")
remote_driver$getPageSource()[[1]]
page_source %>%
read_html() %>%
html_nodes(css = '.result_title') %>%
html_text()

Resources