I would like to find all "a href" of a particular webpage, there are examples how to do it in XML, rvest or splashr, I would like to do it with Rselenium and Not by finding the elements first and then getElementAttribute(..., "href").
I'm looking for something similiar to read_html, html_nodes, html_attr from rvest, or render_html from splashr which later works with rvest.
EDIT: Ideally something similiar to render_html which lets all scrips finish.
Start of code:
library(rselenium)
rd <- rsDriver(verbose = F)
remotedriver <- rd$client
remotedriver$navigate('https://stackoverflow.com/')
Now I get stuck on how to find all URLs that points somewhere. I have tried
library(xml)
html_parse <- htmlParse(remote$getPageSource()[1])
But dont know which functions can be used to work with the html_parse object.
Related
I am trying to scrape this link here: https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer -- and return the player props on the page in some sort of workable table within R where I can clean it to a final result.
I am working with the RSelenium package in combination with the tidyverse and rvest in order to scrape this info into R. I have had success on other pages on this website in the past, but can't seem to crack this one.
I've gotten as far as Inspecting the webpage down to the most granular <div> that contains the entire list of players on the page, and copied the corresponding xpath from that line of the inspection.
My code looks as such:
# Run this code to scrape the player props for goals from draftkings
library(tidyverse)
library(RSelenium)
library(rvest)
# start up local selenium server
rD <- rsDriver(browser = "chrome", port=6511L, chromever = "96.0.4664.45")
remote_driver <- rD$client
# Open chrome
remote_driver$open()
# Navigate to URL
url <- "https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer"
remote_driver$navigate(url)
# Find the table via the XML path
table_xml <- remote_driver$findElement(using = "xpath", value = "//*[#id='root']/section/section[2]/section/div[3]/div/div[3]/div/div/div[2]/div")
# Locates the table, turns it into a list, and binds into a single dataframe
player_prop_table <- table_xml$getElementAttribute("innerHTML")
That last line, instead of returning a workable list, tibble, or dataframe like I'm used to returns a large list that contains the same values I see on the Chrome inspect tool.
What am I missing here in terms of successfully scraping this page?
I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.
More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.
My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?
I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.
library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)
test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)
# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()
# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()
# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()
You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.
Showing just a small chunk, choosing "BEARFACTS" in rvest is:
BEARFACTS\n \n \n
while in RSelenium it is something like the following :
<li class=\"expanded dropdown\">\n BEARFACTS\n
The difference between RSelenium and rvest is:
RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.
Some useful tips:
When scraping a page that doesn't load javascript, use rvest.
When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).
Example of using RSelenium headless
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))
rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps)
I tried using rvest to extract links of "VAI ALLA SCHEDA PRODOTTO" form this website:
https://www.asusworld.it/series.asp?m=Notebook#db_p=2
My R code:
library(rvest)
page.source <- read_html("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
version.block <- html_nodes(page.source, "a") %>% html_attr("href")
However, I can't get any links look like "/model.asp?p=2340487". How can I do?
element looks like this
You may utilize RSelenium to request the intended information from the website.
Load the relevant packages. (Please ensure that the R package 'wdman' is up-to-date.)
library("RSelenium")
library("wdman")
Initialize the R Selenium server (I use Firefox - recommended).
rD <- rsDriver(browser = "firefox", port = 4850L)
rd <- rD$client
Navigate to the URL (and set an appropriate waiting time).
rd$navigate("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
Sys.sleep(5)
Request the intended information (you may refer to, for example, the 'xpath' of the element).
element <- rd$findElement(using = 'xpath', "//*[#id='series']/div[2]/div[2]/div/div/div[2]/table/tbody/tr/td/div/a/div[2]")
Display the requested element (i.e., information).
element$getElementText()
[[1]]
[1] "VAI ALLA SCHEDA PRODOTTO"
A detailed tutorial is provided here (for OS, see this tutorial). Hopefully, this helps.
I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?
Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.
[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.
I would like to seek helping in scraping the information from hardware zone.
This is the link: http://www.hardwarezone.com.sg/search/forum/camera
I would like to get all the information on camera on the forum.
library(RSelenium)
library(magrittr)
base_url = "http://www.hardwarezone.com.sg/search/forum/camera"
checkForServer()
startServer()
remDrv <- remoteDriver()
remDrv$open()
I tried using the above codes for the first part, and I got an error for the last line. (Mac user)
The error I got is undefined RCurl call, but I referenced to many possible solutions but I still cannot solve it.
library (rvest)
url <- "http://www.hardwarezone.com.sg/search/forum/camera"
result <- url %>%
html() %>%
html_nodes (xpath = '//*[id="cse"]/table[1]') %>%
html_table()
result
I tried using another method (Above code), but it still couldn't work.
Can anyone guide me through this?
Thank you.