Webscraping - unable to get the full content of the page with R - web-scraping

I'm trying to webscrape the job ads from this page: https://con.arbeitsagentur.de/prod/jobboerse/jobsuche-ui/?was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&wo=&FCT.ANGEBOTSART=ARBEIT&FCT.BEHINDERUNG=AUS&page=1&size=50&aktualitaet=100
However I'm unable to get the information from the individual job ads. I tried it with rvest, xml2 and V8, but I'm a beginner in webscraping and can't manage to solve this problem. It seems that the link doesn't contain the information about the individual job ads, so that navigating with the xPath doesn't work properly.
Does anyone has an idea how to solve this?
Thanks :)

I have been able to extract the job descriptions with the following code :
library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("https://www.arbeitsagentur.de/jobsuche/suche?angebotsart=1&was=Soziologie%20(grundst%C3%A4ndig)%20(weiterf%C3%BChrend)&id=10000-1189146489-S")
Sys.sleep(10)
list_Button <- remDr$findElements("class name", "ergebnisliste-item")
Sys.sleep(3)
list_Link_Job_Descriptions <- lapply(X = list_Button, FUN = function(x) x$getElementAttribute("href"))
nb_Links <- length(list_Link_Job_Descriptions)
list_Text_Job_Description <- list()
for(i in 1 : nb_Links)
{
print(i)
remDr$navigate(list_Link_Job_Descriptions[[i]][[1]])
Sys.sleep(1)
web_Obj2 <- remDr$findElement("id", "jobdetails-beschreibung")
list_Text_Job_Description[[i]] <- web_Obj2$getElementText()
}

Related

Extract reviews from Free Tours websites

My intention is to extract the reviews of the free tours that appear on these pages:
Guruwalks (https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales)
Freetour.com (https://www.freetour.com/es/budapest/free-tour-budapest-imperial)
I'm working with R on Windows, but when using RSelenium it gives me an error.
My initial code is:
#Loading the rvest package
library(rvest)
library(magrittr) # for the '%>%' pipe symbols
library(RSelenium) # to get the loaded html of
library(purrr) # for 'map_chr' to get reply
df_0<-data.frame(tour=character(),
dates=character(),
names=character(),
starts=character(),
reviews=character())
url_google <- list("https://www.guruwalk.com/es/walks/39405-free-tour-malaga-con-guias-profesionales")
for (apps in url_google) {
#Specifying the url for desired website to be scrapped
url <- apps
# starting local RSelenium (this is the only way to start RSelenium that is working for me atm)
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remDr <- remoteDriver(port = 4567L, browserName = "firefox")
remDr$open()
require(RSelenium)
# go to website
remDr$navigate(url)
The mistake is:
Show in New Window
Error: Summary: SessionNotCreatedException
Detail: A new session could not be created.
Further Details: run errorDetails method
How can I solve it? Thank you

Why can’t RSelenium press this button?

I’m trying to automate browsing on a site with RSelenium in order to retrieve the latest planned release dates. My problem lies in that there is an age-check that pops up when I visit the URL. The page(age-check-page) concists of two buttons, which I haven’t succeeded to click on through RSelenium. The code that I use thus far is appended below, what is the solution for this problem?
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElements("class name", "action")
webE$isElementEnabled()
webE$clickElement()
You need to more accurately target the selector:
#Varialble and URL
s4 <- "https://www.systembolaget.se"
#Start Server
rd <- rsDriver()
remDr <- rd[["client"]]
#Load Page
remDr$navigate(s4)
webE <- remDr$findElement("css", "#modal-agecheck .action.primary")
webE$clickElement()

Webscrape w/ Rselenium and Rvest from dropdown box where id changes

I am looking to scrape some NBA date from the website numberfire at: https://www.numberfire.com/nba/daily-fantasy/daily-basketball-projections
I am trying to go into a drop down box and switch the displayed data from Fanduel to Draftkings. The 1st encountered problem is that the web page does not change with the changes to the that pull down menu. I installed and am successfully running selenium to counter this. However the next problem has been that the id for this pull down menu (and the id for all pull down menus) on this site changes with each refresh. This is causing an error in R as it says there is "NoSuchElement", as it cannot lock on to the proper menu box when it goes to the page.
Is there a way with RSelenium to or another package to fix this?
Here is my code in R:
require(RSelenium)
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445, browserName = "chrome")
remDr$open()
remDr$navigate("https://www.numberfire.com/nba/daily-fantasy/daily-basketball-projections")
iframe <- remDr$findElement(using='id', value="select2-dy8e-container")
remDr$switchToFrame(iframe)
option <- remDr$findElement(using = 'xpath', "//*/option[#value = 'DraftKings']")
option$clickElement()
option
Update after doing a lot of searching on nonstatic Id's I came up with this and it worked:
remDr <- remoteDriver(remoteServerAddr = "192.168.99.100", port = 4445, browserName = "chrome")
remDr$open()
remDr$navigate("https://www.numberfire.com/nba/daily-fantasy/daily-basketball-projections")
webElem <- remDr$findElement('xpath', '//*[(#class = "dropdown-custom dfs-option select2-hidden-accessible")]/option[#value = "4"]')
webElem$clickElement()

Retrieve data from a web page table using RSelenium

I am trying to scrape the annual maximum flow data from this National River Flow Archive (UK) website:
http://nrfa.ceh.ac.uk/data/station/info/69032
using RSelenium.
I can't find a way to negotiate the drop down menu. At present I can semi-automate the process using:
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "firefox", platform = "LINUX")
remDr$open()
i <- "69032"
remDr$navigate(paste0("http://nrfa.ceh.ac.uk/data/station/peakflow/", i))
# read the raw html and parse
doc<-htmlParse(remDr$getPageSource()[[1]])
peak.flows <- as.numeric(readHTMLTable(doc)$tablesorter[, "Flow (m3/s)"])
This is a bit of a hack and involves me having to click a few buttons on the page rather than getting RSelenium to do it. Any suggestions as to how RSelenium can select the "Peak flow data" tab and then the "Maximum Annual (AMAX) data" option from the drop-down menu?
library(RSelenium)
checkForServer()
startServer()
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4444, browserName = "firefox", platform = "LINUX")
remDr$open() i <- "69032"
remDr$navigate(paste0("http://nrfa.ceh.ac.uk/data/station/peakflow/", i))
remDr$findElement(using="css selector",'.selected a')$clickElement()
Sys.sleep(5)
remDr$findElement(using = "css selector", "#selectDataType")$clickElement()
remDr$findElement(using = "css selector", "#selectDataType")$sendKeysToElement(list(key="down_arrow", key="enter"))
Sys.sleep(2)`
If you want to know about the css id of the element of interest, please install [SELECTOR GADGET] plugin into chrome. Highlight the element you want RSelenium to click, then grab the css id.

Taking window value using RSelenium

Using RSelenium to open a site like this:
require(RSelenium)
RSelenium::startServer()
remDr <- remoteDriver(browserName = "chrome")
remDr$open()
remDr$navigate("http://www.adobe.com/") #the site is just an example
What command should I use to take in R the results of window.s_adobe?
You could try something in the veins of
res <- remDr$executeScript('return window.screenX;')

Resources