How to scrape text of a drop down menu using RSelenium?

How to scrape text of a drop down menu using RSelenium? - r

I'm trying to scrape this site but I can't get any results. My code is:
library(RSelenium)
url <- "https://derbis.dernekler.gov.tr/IstatistikDerneklerWeb/IlFaaliyetAlaniDernekler"
driver <- rsDriver(browser=c("firefox"), port = 4445L)
remote_driver <- driver[["client"]]
remote_driver$navigate(url)
option <- remote_driver$findElement(using = 'xpath',
"//select[#id='cbIl']/option[#value='ADANA']")
option$clickElement()
My aim is to get the table by clicking the button after selecting the inputs I want. I'll write them in a loop and try all kinds of combinations and get the table, but I did not get the part I mentioned. Can you show me an example?
Any help would be much appreciated.

Looks like the option value is 1, not 'ADANA'
Rather it seems to be #value="1"
If you instead mean to look for the option that has text content 'ADANA', this is a question that has been asked and answered before, for example here:
Xpath: how to select an option based on its text not value property?

Related

Web Scraping in R when you have inputs

I'm trying to write a code for web scraping in R when you have to introduce inputs.
Exactly, I have a platform where I need to complete 2 fields and after that click submit and get the results.
But I don't know how to use my columns in R like inputs in platform.
I searched for an example but I did't find any.
Pls, if anyone can give me a simple e.g.
Thank you
EDIT:
I don't have a code yet. I was looking for an example where you can use input for complete a field on a site and after that to scrape the result.
In the photo are the fields on my URL. So, in R I have a dataframe with 2 columns. One for CNP/CUI and one for VIN/SASIU with 100 rows or more. And I want to use this columns like input and take the output for every row.
EDIT2:
The example provided by #Dominik S.Meier it worked for me when I had a list for inputs. For column inputs I will post another question.
But, till then I want to mention few thing that helped me, maybe it will hep somebody else.
You need to be sure that all the versions matches: R version, browser version, browser driver version, Java version. For me it didn't match chromedriver version, even if I downloaded the right version. The problem was that I had 3 chromeversion and I think it didn't choose the right. I fixed with: rD <- rsDriver(browser = c("chrome"),port = 4444L,chromever = "83.0.4103.39"). More info here:enter link description here
Because one element didn't have id like in e.g. webElem <- remDr$findElement(using = "id", "trimite"), I used css selector. You can find the css selector with right click -> copy -> copy selector (in the html code on the page).
If you don't get the results, maybe you don't use the right selector. I did that and the result was list(). Then I tried more css selector from the "above" in the html code. I don'y know if it is the right solution, but for me it worked.
Hope it will help. Thank you.

Using RSelenium (see here for more infos):
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pro.rarom.ro/istoric_vehicul/dosar_vehicul.aspx") # navigates to webpage
# select first input field
option <- remDr$findElement(using='id', value="inputEmail")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputEmail2")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputVIN")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("123"))
#press key
webElem <- remDr$findElement(using = "id", "trimite")
webElem$highlightElement()
webElem$clickElement()

How to extract the inherent links from the webpage with my code (error: subscript out of bounds)?

I am rather new in webscraping but need data for my PhD project. For this, I am extracting data on different activities of MEPs from the European Parliament's website. Concretely, and where I have problems, I would like to extract the title and especially the link underlying the title of each speech from a MEP's personal page. I use a code that already worked fine several times, but here I do not succeed in getting the link, but only the title of the speech. For the links I get the error message "subscript out of bounds". I am working with RSelenium because there are several load more buttons on the individual pages I have to click first before extracting the data (which makes rvest a complicated option as far as I see it).
I am basically trying to solve this for days now, and I really do not know how to get further. I have the impression that the css selector is not actually capturing the underlying link (as it extracts the title without problems), but the class has a compounded name ("ep-a_heading ep-layout_level2") so it is not possible to go via this way either. I tried Rvest as well (ignoring the problem I would then have for the load more--button) but I still do not get to those links.
```{r}
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-
activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "css", value=".erpl-activities-
loadmore-button .ep_name")
while (!is.null(more)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="css", ".ep-layout_level2 .ep_title")
length(links)
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
URL[i] <- links[[i]]$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
For this example there 128 speeches on the page, so in the end I would need a table with 128 titles and links. The code works fine when I only try for the title but for the URLs I get:
`"Error in links[[i]]$getElementAttribute("href")[[1]] : subscript out of bounds"`
Thank you very much for your help, I already read many posts on subscript out of bounds issues in this forum, but unfortunately I still couldn't solve the problem.
Have a great day!

I don't seem to have a problem using rvest to get that info. No need for overhead of using selenium. You want to target the a tag child of that class i.e. .ep-layout_level2 a in order to be able to access an href attribute. Same selector would apply for selenium.
library(rvest)
library(magrittr)
page <- read_html('https://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8')
titles <- page %>% html_nodes('.ep-layout_level2 .ep_title') %>% html_text() %>% gsub("\\r\\n\\t+", "", .)
links <- page %>% html_nodes('.ep-layout_level2 a') %>% html_attr(., "href")
results <- data.frame(titles,links)

Here you have a working solution based on the code you provided:
library(RSelenium)
library(wdman)
library(rvest, warn.conflicts=FALSE)
library(stringr)
server <- phantomjs(port=7005L)
browser <- remoteDriver(browserName = "phantomjs", port=7005L)
## this is one of the urls I will use, there are others, constructed all
##the same way and all with the same problem
url <- 'http://www.europarl.europa.eu/meps/en/124936/MARIA_ARENA/all-activities/plenary-speeches/8'
browser$open()
browser$navigate(url)
## now I identify the load more button and click on it as long as there
##is a "load more" button on the page
more <- browser$findElement(using = "class",value= "erpl-activity-loadmore-button")
while ((grepl("erpl-activity-loadmore-button",more$getPageSource(),fixed=TRUE)){
more$clickElement()
Sys.sleep(1)}
## I get an error message doing this in the end but it is working anyway
##(yes, I really am a beginner!)
##Now, what I want to extract are the title of the speech and most
##importantly: the URL.
links <- browser$findElements(using="class", "ep-layout_level2")
## there are 128 Speeches listed on the page
URL <- rep(NA, length(links))
Title <- rep(NA, length(links))
## after having created vectors to store the results, I apply the loop
##function that had worked fine already many times to extract the data I
##want
for (i in 1:length(links)){
l=links[[i]]$findChildElement(using="css","a")
URL[i] <-l$getElementAttribute('href')[[1]]
Title[i] <- links[[i]]$getElementText()[[1]]
}
speeches <- data.frame(Title, URL)
speeches
The main differences are:
In the first findElement I use value= erpl-activity-loadmore-button. Indeed the documentation says that you can not look at multiple class values at once
Same when it comes to look for the links
In the final loop, you need fist to select the link element in the
div you selected and then read the href attribute
To answer your question about the error message in comments after the while loop: When you pressed enough time the "Load more" buttons it become invisible, but still exists. So when you check for !is.null(more)it is TRUE because the button still exists, but when you try to click it you get and error message because it is invisible. So you can fix it by checking it it is visible or note.

Use RSelenium to perform drag-and-drop action

I would like to use RSelenium to download files (by clicking on the excel image) from this website http://highereducationstatistics.education.gov.au/. However, before downloading the file, a series of drag-and-drop actions (see this image http://highereducationstatistics.education.gov.au/images/preDragDimension.png) have to be performed so the right dataset could be chosen (See this http://highereducationstatistics.education.gov.au/GettingStarted.aspx for instruction).
I am wondering whether RSelenium has this type of drag and drop functions. I have searched this whole day and guess that mouseMoveToLocation combined with other functions like buttondown function might be the answers, but have no idea how to use them. Can anyone help with this?
Thanks very much.

First navigate with RSelenium to the page using:
library(RSelenium)
rD <- rsDriver() # runs a chrome browser, wait for necessary files to download
remDr <- rD$client
remDr$navigate("http://highereducationstatistics.education.gov.au/")
Then locate the element you want to drag to the chart/grid. In this example I will be selecting the Course Level from the left menu.
webElem1 <- remDr$findElement('xpath', "//span[#title = 'Course Level']")
Select the element where you want to drop this menu item. In this case the element has an id = "olapClientGridXrowHeader":
webElem2 <- remDr$findElement(using = 'id', "olapClientGridXrowHeader")
Once both items are selected, drag the first one into the second one like this:
remDr$mouseMoveToLocation(webElement = webElem1)
remDr$buttondown()
remDr$mouseMoveToLocation(webElement = webElem2)
remDr$buttonup()
Notice that these methods work on the remote driver, not the elements.

blank value captures while scraping using Rselenium

I am trying to scrape a textbox value from the URL in the code. I picked the css using slector gadget. It is not able to capture the content in the text box. Tested several other CSS toobut the textbox value is not captured.
Text box is : construction year
Please help . Below is the code for reference.
url = "https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848"
values = list()
remDr$navigate(url)
page_source<-remDr$getPageSource()
a = read_html(page_source[[1]])
= html_nodes(a,"#ctl00_mainContentPlaceholder_txtConstructionYear_iu")
values = html_text(html_main_node)
values
Thanks in advance

Why RSelenium? It scrapes fine with rvest (though it is a horrible SharePoint site which may cause problems down the end with maintaining the proper view state cookies).
library(rvest)
pg <- html_session("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
html_attr(html_nodes(pg, "input#ctl00_mainContentPlaceholder_txtConstructionYear_iu"), "value")
## [1] 1987
You should be grabbing the value attribute vs the node text. This should work in the your selenium code, too.

The above answer also works. But if you are only trying to use RSelenium. Here is the code
library(RSelenium)
checkForServer()
startServer()
Sys.sleep(5)
re<-remoteDriver()
re$open()
re$navigate("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$clickElement()
text<-unlist(re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$getElementAttribute("value"))
This works

How do I click a javascript "link"? Is it my xpath or my relenium/selenium usage?

Like the beginning to any problem before I post it on stack overflow I think I have tried everything. This is a learning experience for me on how to work with javascript and xml so I'm guessing my problem is there.
My question is how to get the results of clicking on the parcel number links that are javascript links? I've tried getting the xpath of the link and using the $click method which following my intuition but this wasn't right or is at least not working for me.
Firefox 26.0
R 3.0.2
require(relenium)
library(XML)
library(stringr)
initializing_parcel_number <- "00000000000"
firefox <- firefoxClass$new()
firefox$get("http://www.muni.org/pw/public.html")
inputElement <- firefox$findElementByXPath("/html/body/form[2]/table/tbody/tr[2]/td/table[1]/tbody/tr[3]/td[4]/input[1]")
inputElement$sendKeys(initializing_parcel_number)
inputElement$sendKeys(key = "ENTER")
##xpath to the first link. Or is it?
first_link <- "/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a"
##How I'm trying to click the thing.
linkElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a")
linkElement$click()

You can do this using RSelenium. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps
If you are unsure of what element is selected you can use the highlightElement utility method in the webElement class see the commented out code.
The element click event wont work in this case. You need to simulate a click using javascript:
require(RSelenium)
# RSelenium::startServer # if needed
initializing_parcel_number <- "00000000000"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.muni.org/pw/public.html")
webElem <- remDr$findElement(using = "name", "PAR1")
# webElem$highlightElement() # to visually check what elemnet is selected
webElem$sendKeysToElement(list(initializing_parcel_number, key = "enter"))
# get first link containing javascript:getParcel
webElem <- remDr$findElement(using = "css selector", '[href*="javascript:getParcel"]')
# webElem$highlightElement() # to visually check what elemnet is selected
# send a webElement as an argument.
remDr$executeScript("arguments[0].click();", list(webElem))
#

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex