Get webpage links using rvest

Get webpage links using rvest - r

I tried using rvest to extract links of "VAI ALLA SCHEDA PRODOTTO" form this website:
https://www.asusworld.it/series.asp?m=Notebook#db_p=2
My R code:
library(rvest)
page.source <- read_html("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
version.block <- html_nodes(page.source, "a") %>% html_attr("href")
However, I can't get any links look like "/model.asp?p=2340487". How can I do?
element looks like this

You may utilize RSelenium to request the intended information from the website.
Load the relevant packages. (Please ensure that the R package 'wdman' is up-to-date.)
library("RSelenium")
library("wdman")
Initialize the R Selenium server (I use Firefox - recommended).
rD <- rsDriver(browser = "firefox", port = 4850L)
rd <- rD$client
Navigate to the URL (and set an appropriate waiting time).
rd$navigate("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
Sys.sleep(5)
Request the intended information (you may refer to, for example, the 'xpath' of the element).
element <- rd$findElement(using = 'xpath', "//*[#id='series']/div[2]/div[2]/div/div/div[2]/table/tbody/tr/td/div/a/div[2]")
Display the requested element (i.e., information).
element$getElementText()
[[1]]
[1] "VAI ALLA SCHEDA PRODOTTO"
A detailed tutorial is provided here (for OS, see this tutorial). Hopefully, this helps.

Related

Scraping a Webpage with RSelenium

I am trying to scrape this link here: https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer -- and return the player props on the page in some sort of workable table within R where I can clean it to a final result.
I am working with the RSelenium package in combination with the tidyverse and rvest in order to scrape this info into R. I have had success on other pages on this website in the past, but can't seem to crack this one.
I've gotten as far as Inspecting the webpage down to the most granular <div> that contains the entire list of players on the page, and copied the corresponding xpath from that line of the inspection.
My code looks as such:
# Run this code to scrape the player props for goals from draftkings
library(tidyverse)
library(RSelenium)
library(rvest)
# start up local selenium server
rD <- rsDriver(browser = "chrome", port=6511L, chromever = "96.0.4664.45")
remote_driver <- rD$client
# Open chrome
remote_driver$open()
# Navigate to URL
url <- "https://sportsbook.draftkings.com/leagues/hockey/88670853?category=player-props&subcategory=goalscorer"
remote_driver$navigate(url)
# Find the table via the XML path
table_xml <- remote_driver$findElement(using = "xpath", value = "//*[#id='root']/section/section[2]/section/div[3]/div/div[3]/div/div/div[2]/div")
# Locates the table, turns it into a list, and binds into a single dataframe
player_prop_table <- table_xml$getElementAttribute("innerHTML")
That last line, instead of returning a workable list, tibble, or dataframe like I'm used to returns a large list that contains the same values I see on the Chrome inspect tool.
What am I missing here in terms of successfully scraping this page?

How to pull a product link from customer profile page on Amazon

I'm trying to get the product link from a customers profile page usign R's RVEST package
I've referenced various questions on stack overflow including here(could not read webpage with read_html using rvest package from r), but each time I try something, I'm not able to return the correct result.
For example on this profile page:
https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8
I'd like to be able to return this link, with the end goal to extract the product id: B01A51S9Y2
https://www.amazon.com/Amagabeli-Stainless-Chainmail-Scrubber-Pre-Seasoned/dp/B01A51S9Y2?ref=pf_vv_at_pdctrvw_dp
library(dplyr)
library(rvest)
library(stringr)
library(httr)
library(rvest)
# get url
url='https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_dp_d_gw_tr?ie=UTF8'
x <- GET(url, add_headers('user-agent' = 'test'))
page <- read_html(x)
page %>%
html_nodes("[class='a-link-normal profile-at-product-box-link a-text-normal']") %>%
html_text()
#I did a test to see if i could even find the href, with no luck
test <- page %>%
html_nodes("#a-page") %>%
html_text()
grepl("B01A51S9Y2",test)
Thanks for the tip #Qharr on Rselenium. that is helpful, but still unsure how to extract the link or asin. library(RSelenium)
driver <- rsDriver(browser=c("chrome"), port = 4574L, chromever = "77.0.3865.40")
rd <- driver[["client"]]
rd$open()
rd$navigate("https://www.amazon.com/gp/profile/amzn1.account.AETT6GZORFV55BFNOAVFDIJ75QYQ/ref=cm_cr_arp_d_gw_btm?ie=UTF8")
prod <- rd$findElement(using = "css", '.profile-at-product-box-link')
prod$getElementText
This doesn't really return anything
Added the get attribute href, and was able to get the link
prod <- rd$findElements(using = "css selector", '.profile-at-product-box-link')
for (link in 1:length(prod)){
print(prod[[link]]$getElementAttribute('href'))
}

That info is pulled in dynamically from a POST request the page makes that your rvest initial request doesn't capture. This subsequent request returns in json format the content governing asins, the products links etc.....
You can find it in the network tab of dev tools F12. Press F5 to refresh the page then examine network traffic:
It is not a simple POST request to mimic and I would just go with RSelenium to let the page render and then use css selector
.profile-at-product-box-link
to gather a webElements collection you can loop and extract href attribute from.

Nodes from a website are not scraping the content

I have tried to scrape the content of a news website ('titles', 'content', etc) but the nodes I am using do not return the content.
I have tried different nodes/tags, but none of them seem to be working. I have also used the SelectorGadget without any result. I have used the same strategy for scraping other websites and it has worked with no issues.
Here is an example of trying to get the 'content'
library(rvest)
url_test <- read_html('https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc')
content_test <- html_text(html_nodes(url_test, ".article-body-mt-5"))
I have also tried using the xpath instead of the css class with no results.
Here is an example of trying to get the 'date'
content_test <- html_text(html_nodes(url_test, ".article-date"))
Even if I try to scrape all the <h>from the website page, for example, I do also get character(0)
What can be the problem? Thanks for any help!

Since the content is loaded by javascript to the page, I used RSelenium to scrape the data and it worked
library(RSelenium)
#Setting the remote browser
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4444L,
browserName = "chrome")
remDr$open()
url_test <- 'https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc'
remDr$navigate(url_test)
#Checking if the website page is loaded
remDr$screenshot(display = TRUE)
#Getting the content
content_test <- remDr$findElements(using = "css selector", value = '.article-date')
content_test <- sapply(content_test, function(x){x$getElementText()})
> content_test
[[1]]
[1] "22 de Septiembre de 2018"

Two things.
Your css selector is wrong. It should have been:
".article-body.mt-5"
The data is dynamically loaded and returned as json. You can find the endpoint in the network tab. No need for overhead of using selenium.
library(jsonlite)
data <- jsonlite::read_json('https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json')
body is html so you could use html parser. The following is a simple text dump. You would refine with node selection.
library(rvest)
read_html(data[[1]]$body) %>% html_text()

What is the difference between rvest::html_text and RSelenium::getPageSource?

I'm scraping a number of webpages, where I noticed the different results that rvest (read_html, then html_text) provides, and the one that RSelenium (getPageSource()) provides.
More specifically, when dropdown menus are involved, using html_text only gives you the names of the choices, while using RSelenium you can get the url of the page that you will be directed to once you choose one.
My question here would be : (1) why the difference, and what exactly is the nature of the difference? and (2) is there a way to get the same source text extraction as RSelenium one, but using a faster way such as rvest package?
I have tried using webdriver, a PhantomJS implementation, per suggestion from rvest vs RSelenium results for text extracting , and their getSource function does provide the same results as RSelenium. However, while this is faster than RSelenium, it is still much slower than rvest.
library(rvest)
library(RSelenium)
library(webdriver)
library(tictoc)
library(robotstxt)
test_url <- "https://www.bea.gov"
robotstxt::paths_allowed(test_url)
# rvest
tictoc::tic()
resultA <- html_text(read_html(test_url))
tictoc::toc()
# RSelenium
tictoc::tic()
remDr <- remoteDriver(port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate(test_url)
resultB <- remDr$getPageSource(test_url)
tictoc::toc()
# webdriver
tictoc::tic()
pjs <- run_phantomjs()
ses <- Session$new(port = pjs$port)
ses$go(test_url)
resultC <- ses$getSource()
tictoc::toc()
You can see that resultA is different from resultB and resultC. More specifically, my focus would be something from the word "Tools" onwards, which is the part where the dropdown menu for choosing different tabs regarding "Tools" that this website provides.
Showing just a small chunk, choosing "BEARFACTS" in rvest is:
BEARFACTS\n \n \n
while in RSelenium it is something like the following :
<li class=\"expanded dropdown\">\n BEARFACTS\n

The difference between RSelenium and rvest is:
RSelenium runs a real web browser, so it will load any javascript contained in the webpage (javascript is often used to load additional html elements or data after the initial html has loaded).
rvest does not run javascript, and therefore retrieves the page html faster, but will miss any elements loaded with javascript after the initial page load.
Some useful tips:
When scraping a page that doesn't load javascript, use rvest.
When you must use RSelenium, try using a headless option to improve speed (it will load the page in a browser just like normal, but it won't display any of the graphical elements, so it will be faster).
Example of using RSelenium headless
eCaps <- list(chromeOptions = list(
args = c('--headless', '--disable-gpu', '--window-size=1280,800')
))
rD <- rsDriver(browser=c("chrome"), verbose = TRUE, chromever="78.0.3904.105", port=4447L, extraCapabilities = eCaps)

Get Google Chrome's Inspect Element into R

This question is based on another that I saw closed which generated curiosity as I learned something new about Google Chrome's Inspect Element to create the HTML parsing path for XML::getNodeSet. While this question was closed as I think it may have been too broad I'll ask a smaller more focused question that may get at the root of the problem.
I tried to help the poster by writing code I typically use for scraping but ran into a wall immediately as the poster wanted elements from Google Chrome's Inspect Element. This is not the same as the HTML from htmlTreeParse as demonstrated here:
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
doc <- htmlTreeParse(url, useInternalNodes = TRUE)
m <- capture.output(doc)
any(grepl("258.12", m))
## FALSE
But here in Google Chrome's Inspect Element we can see that this information is provided (in yellow):
How can we get the information from Google Chrome's Inspect Element into R? The poster could obviously copy and paste the code into a text editor and parse that way but they are looking to scrape and thus that workflow does not scale. Once the poster can get this info into R they can then use typical HTML parsing techniques (XLM and RCurl-fu).

You should be able to scrape the page using something like the following code for RSelenium. You need to have java installed and available on your path for the startServer() line to work (and thus for you to be able to do anything).
library("RSelenium")
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost",
port = 4444,
browserName = "firefox"
)
url <- "http://collegecost.ed.gov/scorecard/UniversityProfile.aspx?org=s&id=198969"
remDr$open()
remDr$navigate(url)
source <- remDr$getPageSource()[[1]]
Check to make sure it worked according to your test:
> grepl("258.12", source)
[1] TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Get webpage links using rvest - r

Related

Scraping a Webpage with RSelenium

How to pull a product link from customer profile page on Amazon

Nodes from a website are not scraping the content

What is the difference between rvest::html_text and RSelenium::getPageSource?

Get Google Chrome's Inspect Element into R

Categories

Resources