I am new to web scraping and want to scrape data from https://www.forwardpathway.com/us-college-database. I used the following code to extract the data from the table but the page just kept on loading after I clicked the next button. Can anybody point out what is wrong?
library(RSelenium)
library(tidyverse)
library(netstat)
library(xml2)
library(data.table)
library(rvest)
binman::list_versions("chromedriver")
rs_driver_object<-rsDriver(browser="chrome",
chromever="107.0.5304.62",
verbose=F,
port=free_port())
## create the client
remDr<-rs_driver_object$client
## open the brower
remDr$open()
remDr$navigate("https://www.forwardpathway.com/us-college-database")
## locate the table that stores the data
data_table<-remDr$findElement(using = "id","table_1")
#And I tried three different methods to click the next button, but the problem persisted.
## next button method 1
next_button<-remDr$findElement(using = "id",'table_1_next')
next_button$clickElement()
## next button method 2
remDr$executeScript("document.getElementById('table_1_next').click()")
## next button method 3
next_button <- remDr$findElement("id", "table_1_next")
next_button$sendKeysToElement(list(key="enter"))
all_data<-list()
cond<-TRUE
while(cond == TRUE){
data_table_html<-data_table$getPageSource()
page<-read_html(data_table_html %>% unlist())
df<-html_table(page) %>% .[[1]]
all_data<-rbindlist((list(all_data,df)))
Sys.sleep(5)
tryCatch(
{next_button <- remDr$findElement("id", "table_1_next")
next_button$sendKeysToElement(list(key="enter"))
},
error=function(e){
print("script complete")
cond<<-FALSE
}
)
if (cond ==FALSE){
break
}
}
Related
my question is about scraping with RSelenium.
I am trying to scrape data from the following website:
"https://www.nhtsa.gov/ratings" using RSelenium.
My present difficulty lies in managing to skip between pages for a given carmaker.
This is my code so far:
library(RSelenium)
#opens a connection
rD <- rsDriver()
remDr <- rD$client
#goes to the page we want
url <- "https://www.nhtsa.gov/ratings"
remDr$navigate(url)
#clicking to open the manufacturer selection "page"
webElem <- remDr$findElement(using = 'css selector', "#vehicle a")
webElem$clickElement()
#opening the options menu
option.menu <- remDr$findElement(using='css selector', 'select')
option.menu$clickElement()
#selecting one maker, loop over this later
maker.select <- remDr$findElement(using = 'xpath', "//*/option[#value = 'AUDI']")
maker.select$clickElement()
#search our selection
maker.click<-remDr$findElement(using='css selector', '.manufacturer-search-submit')
maker.click$clickElement()
#now we have to go through each car (10 per page), loop later
cars<-remDr$findElement(using='css selector', 'tbody:nth-child(6) a')
individual.link<-cars$getElementAttribute("href")
#going to the next page
next_page<-remDr$findElement(using='css selector', 'button.btn.link-arrow::after')
next_page$clickElement()
But I get the error:
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
Further Details: run errorDetails method
As you can probabily see I am new to RSelenium. Any help that you can give me would be appreciated. Thanks in advance.
Here is another approach that might be of help.
You can access the data simply by sending a GET request to the website. From the website (on the first page), we can see
'https://api.nhtsa.gov/vehicles/byManufacturer?offset=0&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name='
This is where we can get the data. The second page will have offset=10 then 20,30,etc.
If api_url is defined to be the above url, then we can get the data using httr
# request the data
request <- httr::GET(api_url)
# retrieve the content
request_content <- httr::content(request)
request_result <- request_content$results
# request results contains the data of interest
# A few glimpses into the data
# The first model
request_result[[1]]$vehicleModel
# [1] "A3"
request_result[[1]]$modelYear
# [1] 2018
request_result[[1]]$manufacturer
# [1] "AUDI OF AMERICA, INC"
Now by playing around with offset it is straight forward to build a loop and gather all pages
out <- list()
k <- 0L
i <- 1L
while (k < 1e+3) {
req_url <- paste0('https://api.nhtsa.gov/vehicles/byManufacturer?offset=',
k,
'&max=10&sort=overallRating&order=desc&data=crashtestratings,recommendedfeatures&productDetail=all&dateStart=2011-01-01&manufacturerName=AUDI&dateEnd=3000-01-01&name=')
req <- httr::content(httr::GET(req_url))$result
if (length(req) == 0) break
out[[i]] <- req
cat(paste0('\nAdded content for offset \t', k))
i <- i + 1L
k <- k + 10L
}
lengths(out)
# [1] 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Note that you can also play around with manufacturerName in the url and with many more arguments to have clean and tailored data.
I want to extract a table periodicaly from below site.
price list changes when clicked building block names(BLOK 16 A, BLOK 16 B, BLOK 16 C, ...) . URL doesn't change, page changes by trigering
javascript:__doPostBack('ctl00$ContentPlaceHolder1$DataList2$ctl04$lnk_blok','')
I've tried 3 ways after searching google and starckoverflow.
what I've tried no 1: this doesn't triger doPostBack event.
postForm( "http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44", ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok="ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok")
what I've tried no 2: selenium remote seem to works on (http://localhost:4444/) but remotedriver doesn't navigate. returns this error. (Error in checkError(res) :
Undefined error in httr call. httr output: length(url) == 1 is not TRUE)
library(RSelenium)
startServer()
remDr <- remoteDriver()
remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4444L, browserName = "firefox")
remDr$open()
remDr$getStatus()
remDr$navigate("http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44")
what I've tried no 3: this another way to triger dopostback event. it doesn't navigate.
base.url <- "http://www.kentkonut.com.tr/tr/modul/projeler/",
event.target <- 'ctl00$ContentPlaceHolder1$DataList2$ctl03$lnk_blok',
action <- "daire_fiyatlari.aspx?id=44"
ftarget <- paste0(base.url, action)
dum <- getURL(ftarget)
event.val <- unlist(strsplit(dum,"__EVENTVALIDATION\" value=\""))[2]
event.val <- unlist(strsplit(event.val,"\" />\r\n\r\n<script"))[1]
view.state <- unlist(strsplit(dum,"id=\"__VIEWSTATE\" value=\""))[2]
view.state <- unlist(strsplit(view.state,"\" />\r\n\r\n\r\n<script"))[1]
web.data <- postForm(ftarget, "form name" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
"method" = "POST",
"action" = action,
"id" = "ctl00_ContentPlaceHolder1_DataList2_ctl03_lnk_blok",
"__EVENTTARGET"=event.target,
"__EVENTVALIDATION"=event.val,
"__VIEWSTATE"=view.state)
thanks for your help.
library(rvest)
url<-"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44"
pgsession<-html_session(url)
t<-html_table(html_nodes(read_html(pgsession), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]
even_indices<-seq(2,length(t$X1),2)
t<-t[even_indices,]
t<-t[2:(length(t$X1)),]
EDITED CODE:
library(rvest)
url<-"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44"
pgsession<-html_session(url)
pgform<-html_form(pgsession)[[1]]
page<-rvest:::request_POST(pgsession,"http://www.kentkonut.com.tr/tr/modul/projeler/daire_fiyatlari.aspx?id=44",
body=list(
`__VIEWSTATE`=pgform$fields$`__VIEWSTATE`$value,
`__EVENTTARGET`="ctl00$ContentPlaceHolder1$DataList2$ctl01$lnk_blok",
`__EVENTARGUMENT`="",
`__VIEWSTATEGENERATOR`=pgform$fields$`__VIEWSTATEGENERATOR`$value,
`__VIEWSTATEENCRYPTED`=pgform$fields$`__VIEWSTATEENCRYPTED`$value,
`__EVENTVALIDATION`=pgform$fields$`__EVENTVALIDATION`$value
),
encode="form"
)
# in the above example change eventtarget as "ctl00$ContentPlaceHolder1$DataList2$ctl02$lnk_blok" to get different table
t<-html_table(html_nodes(read_html(page), css = "#ctl00_ContentPlaceHolder1_DataList1"), fill= TRUE)[[1]]
even_indices<-seq(2,length(t$X1),2)
t<-t[even_indices,]
t<-t[2:(length(t$X1)),]
I am new in Rselenium, I have been trying to scrape an X page, this page has a lot of pager page (1220), R clicks on each pager page and take information of each one with the following code:
links <- matrix(, 1220 = rpag, ncol = 1)
xx<-1
for (y in 1:(1220)){
remDr$findElement(using="xpath", paste0("//*[#id='paginador_pagina_",y,"']/span"))$clickElement()
page_source<-remDr$getPageSource()
doc <- htmlParse(remDr$getPageSource()[[1]])
for(k in 1:32){
v <- paste0("//*[#id='rb_contResultados']/dl[",k,"]/div/a[2]")
link <- try(doc[v])
link<- try(toString(xmlAttrs(link[[1]])))
link<- sub(".*_blank, ", "", link)
links[xx,1] <- link
xx<- xx+1
}
Sys.sleep(runif(1, 3, 10))
}
remDr$closeWindow()
}
}
But when R is in the page 166 (for example), I having the following error:
Error: Summary: UnexpectedAlertOpen
Detail: A modal dialog was open, blocking this operation
class: org.openqa.selenium.UnhandledAlertException
Why I'm having this error? How can I solve it?
Adapting this SO answer, I'm trying to use rvest to generate a form to scrape the resulting page. I keep coming up with an error.
library(rvest)
url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214"
pg.session <- html_session(url)
pg.form <- html_form(html(pg.session))
filled_form <- set_values(pg.form[[1]],
Month = "8",
Year = "1")
out <- submit_form(session = pg.session, pg.form)
returns this error
Submitting with ''
Error in if (!(submit %in% names(submits))) { :
argument is of length zero
What am I doing wrong?
Well, for one thing, you are not submitting the form you actually filled in and you are also attempting to pass in a list of forms rather than a form, but also it appears there may be a bug in the code that doesn't recognize submit buttons with upper case tags. In this case, the HTML has the code
<INPUT TYPE="SUBMIT" VALUE="Get Prices">
and the submit_form codes calls submit_request which looks for submit buttons via
submits <- Filter(function(x) identical(x$type, "submit"),
form$fields)
and since it checks for values identical to "submit" it's not finding "SUBMIT"
sapply(pg.form[[1]]$fields, function(x) x$type)
# $Market_ID
# [1] "HIDDEN"
# $Month
# NULL
# $Year
# NULL
# $`NULL`
# [1] "SUBMIT"
The easiest thing might be to change it ourselves
filled_form <- set_values(pg.form[[1]],
Month = "08",
Year = "2007")
filled_form$fields[[4]]$type <- "submit"
The other problem is that this version has a bug in the way the URL for the form us resolved. we can fix it with
# incorrectly was: url <- XML::getRelativeURL(session$url, form$url)
body(submit_form)[[3]]<-quote(url <- XML::getRelativeURL(form$url, session$url))
And now finally we can submit the request
out <- submit_form(session = pg.session, filled_form)
# out %>% html_table()
(Tested with rvest_0.2.0.9000)
I am attempting to compile a corpus of the usertimelines of a specific sub-set of Twitter users. My problem is that in the existing code (given below), when a user's account has been suspended or deleted, the code breaks, giving the provided output & error (below).
## ORIGINAL ##
for (user in users){
# Download user's timeline from Twitter
tweets <- userTimeline(user)
# Extract tweets
tweets <- unlist( lapply(tweets, function(t) t$getText() ) )
# Save tweets to file
write.csv(tweets, file=paste(user, ".csv", sep=""), row.names=F)
#Sys.sleep(sleepTime)
}
[1] "Not Found"
Error in twInterfaceObj$doAPICall(cmd, params, method,
...) : Error: Not Found
My question is, how can I keep the script running, saving some sort of null result for the 'missing' (deleted/inactive) accounts?
I am using the twitteR package in R: ftp://cran.r-project.org/pub/R/web/packages/twitteR/twitteR.pdf
#EDIT#
# Extract tweets
# Pause for 60 sec
sleepTime = 60
for (user in users)
{
# tell the loop to skip a user if their account is protected
# or some other error occurs
result <- try(userTimeline(user), silent = TRUE);
if(class(result) == "try-error") next;
# Download user's timeline from Twitter
tweets <- userTimeline(user)
# Extract tweets
tweets <- unlist( lapply(tweets, function(t) t$getText() ) )
# Save tweets to file
write.csv(tweets, file=paste(user, ".csv", sep=""), row.names=F)
# Tell the loop to pause for 60s between iterations to avoid exceeding the Twitter API request limit
print('Sleeping for 60 seconds...')
Sys.sleep(sleepTime);
}
#
# Now inspect tweets to see the user's timeline data
You can catch the exception. see ?try or ?tryCatch. For example:
tweets <- try(userTimeline(user),silent=TRUE)
if(inherits(tweets ,'try-error'))
return(NULL)
else{
## process normally here
}