blank value captures while scraping using Rselenium - r

I am trying to scrape a textbox value from the URL in the code. I picked the css using slector gadget. It is not able to capture the content in the text box. Tested several other CSS toobut the textbox value is not captured.
Text box is : construction year
Please help . Below is the code for reference.
url = "https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848"
values = list()
remDr$navigate(url)
page_source<-remDr$getPageSource()
a = read_html(page_source[[1]])
= html_nodes(a,"#ctl00_mainContentPlaceholder_txtConstructionYear_iu")
values = html_text(html_main_node)
values
Thanks in advance

Why RSelenium? It scrapes fine with rvest (though it is a horrible SharePoint site which may cause problems down the end with maintaining the proper view state cookies).
library(rvest)
pg <- html_session("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
html_attr(html_nodes(pg, "input#ctl00_mainContentPlaceholder_txtConstructionYear_iu"), "value")
## [1] 1987
You should be grabbing the value attribute vs the node text. This should work in the your selenium code, too.

The above answer also works. But if you are only trying to use RSelenium. Here is the code
library(RSelenium)
checkForServer()
startServer()
Sys.sleep(5)
re<-remoteDriver()
re$open()
re$navigate("https://www.ncspo.com/FIS/dbBldgAsset_public.aspx?BldgAssetID=8848")
re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$clickElement()
text<-unlist(re$findElement(using = "css selector", "#ctl00_mainContentPlaceholder_txtConstructionYear_iu")$getElementAttribute("value"))
This works

Related

Unable while using Rvest and Selenium to extract text specific text located in <span> object from LinkedIn

First of all, I'm only a beginner in R so my apologies if this sound like a dumb question.
Basically, I want to scape the experience section in LinkedIn and extract the name of the position. As an example, I picked the profile of Hadley Wickham. As you can see on this Screenshot, the data I need ("Chief Scientist") is located in a Span object, with the span object itself located within several Div objects.
As a first attempt, I figured that I'll just try to extract directly the text from the Span objects using this code. However and unsurprisingly, it returned every text that was in other Span objects.
role <-signals %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I can isolate the text I need by subsetting "[ ]" the object but I'm gonna apply this code to several LinkedIn profiles and the order of the title will change depending on the page. So I thought "Ok maybe I need to specify to R that I want to target the Span object that is located in the experience section and not the whole page" so I thought that I'll just need to mention in the code the "#experience" so that it only pick the Span object I need. But it only returned an empty object.
role <-signals %>%
html_nodes("#experience") %>%
html_nodes("span") %>%
html_nodes(".visually-hidden") %>%
html_text()
I'm pretty sure I'm missing some steps here but I can't figure out what. Maybe I need to specify each objects that are between "#experience" and "span" in order for this code to work but I feel there must be a better and easier way. Hope this make sense. I spent a lot of time trying to debug this and I'm not skilled enough in scraping to find a solution on my own.
As is, this requires RSelenium since data is rendered after the page loads and not with reading pre-defined html page. Refer here on how to launch a browser (either Chrome, Firefox, IE etc..) with the object as remdr
The below snippet opens Firefox but there are other ways to launch any other browser
selCommand <- wdman::selenium(jvmargs = c("-Dwebdriver.chrome.verboseLogging=true"), retcommand = TRUE)
shell(selCommand, wait = FALSE, minimized = TRUE)
remdr <- remoteDriver(port = 4567L, browserName = "firefox")
Sys.sleep(format(runif(1,10,15),digits = 1))
remdr$open()
You might have to login to LinkedIn since it won't allow viewing profiles without signing up. You will need to use the RSelenium clickElement and sendKeys functions to operate the webpage.
remdr$navigate(url = 'https://www.linkedin.com/') # navigate to the link
username <- remdr$findElement(using = 'id', value = 'session_key') # finding the username field
username$sendKeysToElement(list('your_USERNAME'))
password <- remdr$findElement(using = 'id', value = 'session_password') # finding the password field
password$sendKeysToElement(list('your_PASSWORD'))
remdr$findElement(using = 'xpath', value = "//button[#class='sign-in-form__submit-button']")$clickElement() # find and click Signin button
Once the page is loaded, you can get the page source and use rvest functions to read between the HTML tags. You can use this extension to easily get xpath selectors for the text you want to scrape.
pgSrc <- remdr$getPageSource()
pgData <- read_html(pgSrc[[1]])
experience <- pgData %>%
html_nodes(xpath = "//div[#class='text-body-medium break-words']") %>%
html_text(trim = TRUE)
Output of experience:
> experience
[1] "Chief Scientist at RStudio, Inc."

Web Scraping in R when you have inputs

I'm trying to write a code for web scraping in R when you have to introduce inputs.
Exactly, I have a platform where I need to complete 2 fields and after that click submit and get the results.
But I don't know how to use my columns in R like inputs in platform.
I searched for an example but I did't find any.
Pls, if anyone can give me a simple e.g.
Thank you
EDIT:
I don't have a code yet. I was looking for an example where you can use input for complete a field on a site and after that to scrape the result.
In the photo are the fields on my URL. So, in R I have a dataframe with 2 columns. One for CNP/CUI and one for VIN/SASIU with 100 rows or more. And I want to use this columns like input and take the output for every row.
EDIT2:
The example provided by #Dominik S.Meier it worked for me when I had a list for inputs. For column inputs I will post another question.
But, till then I want to mention few thing that helped me, maybe it will hep somebody else.
You need to be sure that all the versions matches: R version, browser version, browser driver version, Java version. For me it didn't match chromedriver version, even if I downloaded the right version. The problem was that I had 3 chromeversion and I think it didn't choose the right. I fixed with: rD <- rsDriver(browser = c("chrome"),port = 4444L,chromever = "83.0.4103.39"). More info here:enter link description here
Because one element didn't have id like in e.g. webElem <- remDr$findElement(using = "id", "trimite"), I used css selector. You can find the css selector with right click -> copy -> copy selector (in the html code on the page).
If you don't get the results, maybe you don't use the right selector. I did that and the result was list(). Then I tried more css selector from the "above" in the html code. I don'y know if it is the right solution, but for me it worked.
Hope it will help. Thank you.
Using RSelenium (see here for more infos):
library(RSelenium)
rD <- rsDriver(browser = c("firefox")) #specify browser type you want Selenium to open
remDr <- rD$client
remDr$navigate("https://pro.rarom.ro/istoric_vehicul/dosar_vehicul.aspx") # navigates to webpage
# select first input field
option <- remDr$findElement(using='id', value="inputEmail")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputEmail2")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("email#email.com"))
# select second input field
option <- remDr$findElement(using='id', value="inputVIN")
option$highlightElement()
option$clickElement()
option$sendKeysToElement(list("123"))
#press key
webElem <- remDr$findElement(using = "id", "trimite")
webElem$highlightElement()
webElem$clickElement()

Nodes from a website are not scraping the content

I have tried to scrape the content of a news website ('titles', 'content', etc) but the nodes I am using do not return the content.
I have tried different nodes/tags, but none of them seem to be working. I have also used the SelectorGadget without any result. I have used the same strategy for scraping other websites and it has worked with no issues.
Here is an example of trying to get the 'content'
library(rvest)
url_test <- read_html('https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc')
content_test <- html_text(html_nodes(url_test, ".article-body-mt-5"))
I have also tried using the xpath instead of the css class with no results.
Here is an example of trying to get the 'date'
content_test <- html_text(html_nodes(url_test, ".article-date"))
Even if I try to scrape all the <h>from the website page, for example, I do also get character(0)
What can be the problem? Thanks for any help!
Since the content is loaded by javascript to the page, I used RSelenium to scrape the data and it worked
library(RSelenium)
#Setting the remote browser
remDr <- RSelenium::remoteDriver(remoteServerAddr = "192.168.99.100",
port = 4444L,
browserName = "chrome")
remDr$open()
url_test <- 'https://lasillavacia.com/silla-llena/red-de-la-paz/historia/las-disidencias-son-fruto-de-anos-de-division-interna-de-las-farc'
remDr$navigate(url_test)
#Checking if the website page is loaded
remDr$screenshot(display = TRUE)
#Getting the content
content_test <- remDr$findElements(using = "css selector", value = '.article-date')
content_test <- sapply(content_test, function(x){x$getElementText()})
> content_test
[[1]]
[1] "22 de Septiembre de 2018"
Two things.
Your css selector is wrong. It should have been:
".article-body.mt-5"
The data is dynamically loaded and returned as json. You can find the endpoint in the network tab. No need for overhead of using selenium.
library(jsonlite)
data <- jsonlite::read_json('https://lasillavacia.com/silla_llena_api/get?path=/contenido-nodo/68077&_format=hal_json')
body is html so you could use html parser. The following is a simple text dump. You would refine with node selection.
library(rvest)
read_html(data[[1]]$body) %>% html_text()

WebScraping dynamic pages in R

I will change the website, to make this question better. Still facing similar issues, that can't use only rvest package and maybe answer will be easier to obtain with RSelenium. Website: http://ravimaailma.fi/cg/tulokset/20/ and I want to obtain links from the main article which would direct me to individual race results. Links look something like this: http://ravimaailma.fi/article/tulokset/pori-18-11-2017-tulokset/8718/
I'm trying to use simple Rvest as thought that would be all needed here. SelectorGadget is giving links CSS as .article-title a, so my code is simply
url %>%
read_html() %>%
html_nodes(".article-title a") %>%
html_text()
This will return nothing. Website loads more results when you scroll down, but I thought I would atleast get first results out. Below gives out some links and links 28:32 looks promising, but I think they are links from the sidebar, not from article.
url %>%
read_html() %>%
html_nodes("a") %>%
html_attr("href")
What I'm I doing wrong here and can RSelenium help me?
Here is my partial answer, still not getting all, but maybe helps some one. Code will return 1 link for first result. Not sure why it isn't giving them all. I'm using
library(RSelenium)
rD <- rsDriver(port = 4444L, browser = "chrome")
remDr <- rD[["client"]]
remDr$navigate("http://ravimaailma.fi/cg/tulokset/20/")
elem <- remDr$findElement(using="css selector", value=".article-title a")
elemtxt <- elem$getElementAttribute("href")
#Click button to load more results
#button <- remDr$findElement(using="id", value="loadmore")
#button$clickElement()
remDr$close()
I haven't used button click yet, but seemed that it was working as well. Only problem is that I can't get all results from the site.
[I'm not (yet) allowed to write comments, so I chose to make this post an answer]
RSelenium is not always necessary, you can also interact with a website using directly PhantomJS (see e.g. this example).
If you provided an example from the website instead of a local link to a .pdf, I can try to find out how to retrieve the data.

How do I click a javascript "link"? Is it my xpath or my relenium/selenium usage?

Like the beginning to any problem before I post it on stack overflow I think I have tried everything. This is a learning experience for me on how to work with javascript and xml so I'm guessing my problem is there.
My question is how to get the results of clicking on the parcel number links that are javascript links? I've tried getting the xpath of the link and using the $click method which following my intuition but this wasn't right or is at least not working for me.
Firefox 26.0
R 3.0.2
require(relenium)
library(XML)
library(stringr)
initializing_parcel_number <- "00000000000"
firefox <- firefoxClass$new()
firefox$get("http://www.muni.org/pw/public.html")
inputElement <- firefox$findElementByXPath("/html/body/form[2]/table/tbody/tr[2]/td/table[1]/tbody/tr[3]/td[4]/input[1]")
inputElement$sendKeys(initializing_parcel_number)
inputElement$sendKeys(key = "ENTER")
##xpath to the first link. Or is it?
first_link <- "/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a"
##How I'm trying to click the thing.
linkElement <- firefox$findElementByXPath("/html/body/table/tbody/tr[2]/td/table[5]/tbody/tr[2]/td[1]/a")
linkElement$click()
You can do this using RSelenium. See http://johndharrison.github.io/RSelenium/ . DISCLAIMER I am the author of the RSelenium package. A basic vignette on operation can be viewed at RSelenium basics and
RSelenium: Testing Shiny apps
If you are unsure of what element is selected you can use the highlightElement utility method in the webElement class see the commented out code.
The element click event wont work in this case. You need to simulate a click using javascript:
require(RSelenium)
# RSelenium::startServer # if needed
initializing_parcel_number <- "00000000000"
remDr <- remoteDriver()
remDr$open()
remDr$navigate("http://www.muni.org/pw/public.html")
webElem <- remDr$findElement(using = "name", "PAR1")
# webElem$highlightElement() # to visually check what elemnet is selected
webElem$sendKeysToElement(list(initializing_parcel_number, key = "enter"))
# get first link containing javascript:getParcel
webElem <- remDr$findElement(using = "css selector", '[href*="javascript:getParcel"]')
# webElem$highlightElement() # to visually check what elemnet is selected
# send a webElement as an argument.
remDr$executeScript("arguments[0].click();", list(webElem))
#

Resources