Extract reviews from Free Tours websites

Extract reviews from Free Tours websites - r

My intention is to extract the reviews of the free tours that appear on these pages:
Guruwalks (https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales)
Freetour.com (https://www.freetour.com/es/budapest/free-tour-budapest-imperial)
I'm working with R on Windows, but when using RSelenium it gives me an error.
My initial code is:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
df_0<-data.frame(tour=character(),
dates=character(),
names=character(),
starts=character(),
reviews=character())
url_google <- list("https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales")
for (apps in url_google) {
#Specifying the url for desired website to be scrapped
url <- apps
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
The mistake is:
Show in New Window
Error: Summary: SessionNotCreatedException
Detail: A new session could not be created.
Further Details: run errorDetails method
How can I solve it? Thank you

Related

How to download embedded PDF files from webpage using RSelenium?

EDIT: From the comments I received so far, I managed to use RSelenium to access the PDF files I am looking for, using the following code:
library(RSelenium)
driver <- rsDriver(browser = "firefox")
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
# It needs some time to load the page
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']")
option$clickElement()
Now, I need R to click the download button, but I could not manage to do so. I tried:
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement()
But I get the following error:
Selenium message:Unable to locate element: //*[#id="download"]
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10'
Erro: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
Can someone tell what is wrong here?
Thanks!
Original question:
I have several webpages from which I need to download embedded PDF files and I am looking for a way to automate it with R. This is one of the webpages: https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398
This is a webpage from CVM (Comissão de Valores Mobiliários, the Brazilian equivalent to the US Securities and Exchange Commission - SEC) to download Notes to Financial Statements (Notas Explicativas) from Brazilian companies.
I tried several options but the website seems to be built in a way that makes it difficult to extract the direct links.
I tried what is suggested in here Downloading all PDFs from URL, but the html_nodes(".ms-vb2 a") %>% html_attr("href") yields an empty character vector.
Similarly, when I tried the approach in here https://www.samuelworkman.org/blog/scraping-up-bits-of-helpfulness/, the html_attr("href") generates an empty vector.
I am not used to web scraping codes in R, so I cannot figure out what is happening.
I appreciate any help!

If someone is facing the same problem I did, I am posting the solution I used:
# set Firefox profile to download PDFs automatically
pdfprof <- makeFirefoxProfile(list(
"pdfjs.disabled" = TRUE,
"plugin.scan.plid.all" = FALSE,
"plugin.scan.Acrobat" = "99.0",
"browser.helperApps.neverAsk.saveToDisk" = 'application/pdf'))
driver <- rsDriver(browser = "firefox", extraCapabilities = pdfprof)
remote_driver <- driver[["client"]]
remote_driver$navigate("https://www.rad.cvm.gov.br/enetconsulta/frmGerenciaPaginaFRE.aspx?CodigoTipoInstituicao=1&NumeroSequencialDocumento=62398")
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
option <- remote_driver$findElement(using = 'xpath', "//select[#id='cmbGrupo']/option[#value='PDF|412']") # select the option to open PDF file
option$clickElement()
# Find iframes in the webpage
web.elem <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem, function(x){x$getElementAttribute("id")}) # see their names
remote_driver$switchToFrame(web.elem[[1]]) # Move to the first iframe (Formularios Filho)
web.elem.2 <- remote_driver$findElements(using = "css", "iframe") # get all iframes in the webpage
sapply(web.elem.2, function(x){x$getElementAttribute("id")}) # see their names
# The pdf Viewer iframe is the only one inside Formularios Filho
remote_driver$switchToFrame(web.elem.2[[1]]) # Move to the first iframe (pdf Viewer)
Sys.sleep(3) # It needs some time to load the page (set to 3 seconds)
# Download the PDF file
button <- remote_driver$findElement(using = "xpath", "//*[#id='download']")
button$clickElement() # download
Sys.sleep(3) # Need sometime to finish download and then close the window
remote_driver$close() # Close the window

Webscraping - unable to get the full content of the page with R

I'm trying to webscrape the job ads from this page: https://con.arbeitsagentur.de/prod/jobboerse/jobsuche-ui/?was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&wo=&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&page=1&size=50&aktualitaet=100
However I'm unable to get the information from the individual job ads. I tried it with rvest, xml2 and V8, but I'm a beginner in webscraping and can't manage to solve this problem. It seems that the link doesn't contain the information about the individual job ads, so that navigating with the xPath doesn't work properly.
Does anyone has an idea how to solve this?
Thanks :)

I have been able to extract the job descriptions with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&id=10000-1189146489-S")
Sys.sleep(10)
list_Button <- remDr$findElements("class name", "ergebnisliste-item")
Sys.sleep(3)
list_Link_Job_Descriptions <- lapply(X = list_Button, FUN = function(x) x$getElementAttribute("href"))
nb_Links <- length(list_Link_Job_Descriptions)
list_Text_Job_Description <- list()
for(i in 1 : nb_Links)
{
print(i)
remDr$navigate(list_Link_Job_Descriptions[[i]][[1]])
Sys.sleep(1)
web_Obj2 <- remDr$findElement("id", "jobdetails-beschreibung")
list_Text_Job_Description[[i]] <- web_Obj2$getElementText()
}

Rvest not seeing xpath in website

I am attempting to scrape this website using the rvest package in R. I have done it successfully with several other website but this one doesn't seem to work and I am not sure why.
I copied the xpath from inside chrome's inspector tool, but when i specify it in the rvest script it shows that it doesn't exist. Does it have anything to do with the fact that the table is generated and not static?
appreciate the help!
library(rvest)
library (tidyverse)
library(stringr)
library(readr)
a<-read_html("http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201")
a<-html_node(a, xpath="//*[#id='indicator10']")
a<-html_table(a)
a

Regarding your question, yes, you are unable to get it because is being generated dynamically. In these cases, it's better to use the RSelenium library:
#Loading libraries
library(rvest) # to read the html
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of the website
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "chrome")
remDr$open()
#Specifying the url for desired website to be scrapped
url <- "http://www.diversitydatakids.org/data/profile/217/benton-county#ind=10,12,15,17,13,20,19,21,24,2,22,4,34,35,116,117,123,99,100,127,128,129,199,201"
# go to website
remDr$navigate(url)
# get page source and save it as an html object with rvest
html_obj <- remDr$getPageSource(header = TRUE)[[1]] %>% read_html()
# get the element you are looking for
a <-html_node(html_obj, xpath="//*[#id='indicator10']")
I guess that you are trying to get the first table. In that case, maybe it's better to just get the table with read_table:
# get the table with the indicator10 id
indicator10_table <-html_node(html_obj, "#indicator10 table") %>% html_table()
I'm using the CSS selector this time instead of the XPath.
Hope it helps! Happy scraping!

Why can’t RSelenium press this button?

I’m trying to automate browsing on a site with RSelenium in order to retrieve the latest planned release dates. My problem lies in that there is an age-check that pops up when I visit the URL. The page(age-check-page) concists of two buttons, which I haven’t succeeded to click on through RSelenium. The code that I use thus far is appended below, what is the solution for this problem?
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElements("class name", "action")
webE$isElementEnabled()
webE$clickElement()

You need to more accurately target the selector:
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElement("css", "#modal-agecheck .action.primary")
webE$clickElement()

Rselenium web-scraping with R

For example, I want to scrape the data from this web-page(The Space,Amenities,Prices...and reviews
https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd
I want to use for this purpose rselenium package.
This is my code:
url <- "https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd"
library('RSelenium')
pJS <- phantom()
library('XML')
shell.exec(paste0("C:\\Users\\Daniil\\Desktop\\R-language,Python\\file.bat"))
Sys.sleep(10)
checkForServer()
startServer()
remDr <- remoteDriver(browserName="chrome", port=4444)
remDr$open(silent=T)
and then with the help of SelectorGadget I found, I think, right elements for scraping:
var <- remDr$findElements('css selector','#details hr+ .row')
My question is: how can I bring it into the text(character strings)?
Or maybe exists other approach with rselenium for collecting data.
Many thanks

I'm not sure what is in file.bat but it appears you are primarily interested in collecting data about the amenities of the listing. I just used firefox and skipped over the phantomjs parts of your code:
url <- "https://www.airbnb.com/rooms/9985824?guests=1&s=d2dNfFMd"
library('RSelenium')
checkForServer()
startServer()
remDr <- remoteDriver(browserName="firefox", port=4444)
remDr$open(silent=T)
remDr$navigate(url)
var <- remDr$findElement('css selector','#details hr+ .row')
print(var$getElementText())
[[1]]
[1] "The Space\nAccommodates: 2\nBathrooms: 1.5\nBed type: Real Bed\nBedrooms: 1\nBeds: 1\nProperty type: Apartment\nRoom type: Private room\nHouse Rules"
From here you can parse the string or perform additional data collecting.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract reviews from Free Tours websites - r

Related

How to download embedded PDF files from webpage using RSelenium?

Webscraping - unable to get the full content of the page with R

Rvest not seeing xpath in website

Why can’t RSelenium press this button?

Rselenium web-scraping with R

Categories

Resources