Rselenium web-scraping with R

Rselenium web-scraping with R - r

For example, I want to scrape the data from this web-page(The Space,Amenities,Prices...and reviews
https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd
I want to use for this purpose rselenium package.
This is my code:
url <- "https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd"
library('RSelenium')
pJS <- phantom()
library('XML')
shell.exec(paste0("C:\\Users\\Daniil\\Desktop\\R-language,Python\\file.bat"))
Sys.sleep(10)
checkForServer()
startServer()
remDr <- remoteDriver(browserName="chrome", port=4444)
remDr$open(silent=T)
and then with the help of SelectorGadget I found, I think, right elements for scraping:
var <- remDr$findElements('css selector','#details hr+ .row')
My question is: how can I bring it into the text(character strings)?
Or maybe exists other approach with rselenium for collecting data.
Many thanks

I'm not sure what is in file.bat but it appears you are primarily interested in collecting data about the amenities of the listing. I just used firefox and skipped over the phantomjs parts of your code:
url <- "https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd"
library('RSelenium')
checkForServer()
startServer()
remDr <- remoteDriver(browserName="firefox", port=4444)
remDr$open(silent=T)
remDr$navigate(url)
var <- remDr$findElement('css selector','#details hr+ .row')
print(var$getElementText())
[[1]]
[1] "The Space\nAccommodates: 2\nBathrooms: 1.5\nBed type: Real Bed\nBedrooms: 1\nBeds: 1\nProperty type: Apartment\nRoom type: Private room\nHouse Rules"
From here you can parse the string or perform additional data collecting.

Related

Get RSelenium to print URL's that it's finished scraping

I'm running a loop to scrape a massive amount of data using RSelenium. If the loop breaks, I'd like to see the element and URL where RSelenium left off at.
Is there a way to print out the element the link is in and the url as each page is completed?
Using the below prints [[1]] [1] "" and that's it.
# check completed links
complete <- rd$findElement(using = "tag name", "a")
for(url in length(complete)){
done <- complete[[url]]
print(done$getElementText())
}

You can use getCurrentUrl() instead of getElementText():
library(RSelenium)
driver <- rsDriver(browser = c("firefox"))
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.r-project.org/")
remote_driver$getCurrentUrl()
[[1]]
[1] "https://www.r-project.org/"

Extract reviews from Free Tours websites

My intention is to extract the reviews of the free tours that appear on these pages:
Guruwalks (https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales)
Freetour.com (https://www.freetour.com/es/budapest/free-tour-budapest-imperial)
I'm working with R on Windows, but when using RSelenium it gives me an error.
My initial code is:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
df_0<-data.frame(tour=character(),
dates=character(),
names=character(),
starts=character(),
reviews=character())
url_google <- list("https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales")
for (apps in url_google) {
#Specifying the url for desired website to be scrapped
url <- apps
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
The mistake is:
Show in New Window
Error: Summary: SessionNotCreatedException
Detail: A new session could not be created.
Further Details: run errorDetails method
How can I solve it? Thank you

Using Right Tag (class, div, span, table, etc.) Using rvest in R

I have started using the rvest package and have encountered some consistent problems, namely exactly how to refer to the HTML code.
For example, the below code returns a null character (ultimately want 0.74). Basically the only thing I can get to return is using "div" as the node, which just returns all text. "tr.total-return", "total-return", "div.sal-trailing-return__middle" all returned null too.
a=read_html(https://www.morningstar.com/funds/xnas/hcyix/performance)
b=html_nodes(a, "td")

That page loads dynamically. You thus need to use RSelenium, and not just rvest.
This code works for me to obtain the data point of 0.74.
library(rvest)
library(tidyverse)
library(RSelenium)
url<- "https://www.morningstar.com/funds/xnas/hcyix/performance"
# RSelenium with Firefox
rD <- RSelenium::rsDriver(browser="firefox", port=4546L, verbose=F)
remDr <- rD[["client"]]
remDr$navigate(url)
Sys.sleep(4)
# get the page source
web <- remDr$getPageSource()
web <- xml2::read_html(web[[1]])
b <- html_node(web, ".total-return > td:nth-child(1)") %>%
html_text() %>%
trimws()
# close RSelenium
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

Get webpage links using rvest

I tried using rvest to extract links of "VAI ALLA SCHEDA PRODOTTO" form this website:
https://www.asusworld.it/series.asp?m=Notebook#db_p=2
My R code:
library(rvest)
page.source <- read_html("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
version.block <- html_nodes(page.source, "a") %>% html_attr("href")
However, I can't get any links look like "/model.asp?p=2340487". How can I do?
element looks like this

You may utilize RSelenium to request the intended information from the website.
Load the relevant packages. (Please ensure that the R package 'wdman' is up-to-date.)
library("RSelenium")
library("wdman")
Initialize the R Selenium server (I use Firefox - recommended).
rD <- rsDriver(browser = "firefox", port = 4850L)
rd <- rD$client
Navigate to the URL (and set an appropriate waiting time).
rd$navigate("https://www.asusworld.it/series.asp?m=Notebook#db_p=2")
Sys.sleep(5)
Request the intended information (you may refer to, for example, the 'xpath' of the element).
element <- rd$findElement(using = 'xpath', "//*[#id='series']/div[2]/div[2]/div/div/div[2]/table/tbody/tr/td/div/a/div[2]")
Display the requested element (i.e., information).
element$getElementText()
[[1]]
[1] "VAI ALLA SCHEDA PRODOTTO"
A detailed tutorial is provided here (for OS, see this tutorial). Hopefully, this helps.

Why can’t RSelenium press this button?

I’m trying to automate browsing on a site with RSelenium in order to retrieve the latest planned release dates. My problem lies in that there is an age-check that pops up when I visit the URL. The page(age-check-page) concists of two buttons, which I haven’t succeeded to click on through RSelenium. The code that I use thus far is appended below, what is the solution for this problem?
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElements("class name", "action")
webE$isElementEnabled()
webE$clickElement()

You need to more accurately target the selector:
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElement("css", "#modal-agecheck .action.primary")
webE$clickElement()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Rselenium web-scraping with R - r

Related

Get RSelenium to print URL's that it's finished scraping

Extract reviews from Free Tours websites

Using Right Tag (class, div, span, table, etc.) Using rvest in R

Get webpage links using rvest

Why can’t RSelenium press this button?

Categories

Resources