I'm trying write a crawler to download some information, similar to this Stack Overflow post. The answer is useful for creating the filled-in form, but I'm struggling to find a way to submit the form when a submit button is not part of the form. Here is an example:
session <- html_session("www.chase.com")
form <- html_form(session)[[3]]
filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password)
session <- submit_form(session, filledform)
At this point, I receive this error:
Error in names(submits)[[1]] : subscript out of bounds
How can I make this form submit?
Here's a dirty hack that works for me: After studying the submit_form source code, I figured that I could work around the problem by injecting a fake submit button into my code version of the form, and then the submit_form function would call that. It works, except that it gives a warning that often lists an inappropriate input object (not in the example below, though). However, despite the warning, the code works for me:
session <- html_session("www.chase.com")
form <- html_form(session)[[3]]
# Form on home page has no submit button,
# so inject a fake submit button or else rvest cannot submit it.
# When I do this, rvest gives a warning "Submitting with '___'", where "___" is
# often an irrelevant field item.
# This warning might be an rvest (version 0.3.2) bug, but the code works.
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
form[["fields"]][["submit"]] <- fake_submit_button
user_name <- "user"
usr_password <- "password"
filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password)
session <- submit_form(session, filledform)
The successful result displays the following warning, which I simply ignore:
> Submitting with 'submit'
Related
Edit: updated with real, working login credentials, and with a few efforts we've tried, to no success
library(dplyr)
library(rvest)
library(xml2)
login_to_website <- function() {
# Login, create a web session with the desired login address
login_url <- "https://www.sportsbettingdime.com/plus/register/"
pgsession <- session(login_url); Sys.sleep(1)
# grab & submit 1st form on page
pgform <- html_form(pgsession)[[1]];
filled_form <- html_form_set(pgform, email="nickcanova#gmail.com", password="tpe.vxp3ZRT!twq.wkd"); # not working
session_submit(pgsession, filled_form); Sys.sleep(1)
# return session to use in other functions
return(pgsession)
}
We are looking to log into this website, however the fields in the pgform that is returned (pgform$fields) are unnamed. They is no name field in the form when you inspect the website either...
> pgform
<form> '<unnamed>' (GET )
<field> (text) :
<field> (password) :
<field> (button) :
>
As a result, the html_form_set call doesn't work, throwing the error Error: Can't set value of fields that don't exist: ' email ', ' password '.
EDIT: It looks like this works for setting the fields (replace html_form_set with these 4 lines below), however the result still does not end us logged in...
pgform$fields[1][[1]]$value <- 'nickcanova#gmail.com' # add email
pgform$fields[2][[1]]$value <- 'tpe.vxp3ZRT!twq.wkd' # add password
pgform$method = "POST" # change from GET to POST
pgform$action <- login_url # was NULL, cant be NULL, set to login_url? (not sure here)
We set the method to POST and we set the action to the url solely because, in another function that logs into a different website, the method is POST and the action is the URL of the page? Not sure if these are correct though...
How to check if login is successful - in the code below, the columns for money and sharp are omitted when not logged in, and available when logged in. Currently we are getting all NA values in these last 2 columns, the goal is to get actual numbers in these 2 fields:
page_url <- paste0('https://www.sportsbettingdime.com/', league, '/odds/')
page_overall <- pgsession %>% session_jump_to(page_url)
table_list <- page_overall %>% html_nodes('table.odds-table') %>% html_table()
df1 <- table_list[1] %>% as.data.frame()
colnames(df1) = c('teams', 'spread1', 'spread2', 'spread3', 'spread4', 'spread5', 'bet', 'money', 'sharp', 'bet2')
df1[, c('money', 'sharp')]
I have worked out how to log into a website from the following:
https://riptutorial.com/r/example/23955/using-rvest-when-login-is-required
Login code:
login <-"https://www.mysite.com.au/"
pgsession <- html_session(login)
pgform <- html_form(pgsession)[[2]] #
filled_form <- set_values(pgform,
'ctl00$Content$Login' = "bangbang",
'ctl00$Content$Password' = "xxxxxx")
submit_form(pgsession, filled_form)
I'm now looking to solve the next problem on how exactly to logout of the website from R.
The website that I log into has a logout button, which is what I use when logging out via browser
I'm still very new using rvest and don't have a proper understanding of it all. Hence, would appreciate detailed help:
How to work out the logout string (I looked in the console and XHR in firefox but not sure which one to use)
What command I should use to logout
Thanks in advance.
Kind regards
I have a process that after login, finds and stores the URL used to logout.
myfunc <- function(user, pass) {
sess <- rvest::html_session(myurl)
############################################################
### all of this is particular to the URL that *I* am scraping ... adapt for your own
loginform <- sess %>%
rvest::html_nodes("form") %>%
rvest::html_form()
formtypes <- lapply(loginform[[1]]$fields, `[[`, "type")
formuser <- names(Filter(function(a) a == "text", formtypes))
formpass <- names(Filter(function(a) a == "password", formtypes))
formsubmit <- names(Filter(function(a) a == "submit", formtypes))
formfields <- setNames(list(user, pass), c(formuser, formpass))
formfields <- do.call(rvest::set_values, c(list(loginform[[1]]), formfields))
loggedin <- rvest::submit_form(sess, formfields, formsubmit)
# ... okay, now I'm logged in
############################################################
### this next section is pertinent to your need to find and eventually
### use the logout link in your html session
# prepare for eventual logout
logoutnodes <- rvest::html_nodes(loggedin, "a")
logoutlinks <- rvest::html_attr(logoutnodes, "href")
logouttexts <- rvest::html_text(logoutnodes)
logoutind <- grep("log.*out", logouttexts, ignore.case = TRUE)
logouturl <- logoutlinks[ logoutind[[1]] ]
# ignore errors on logout attempt
on.exit({
# this code executes when 'myfunc' exits ... for whatever reason
tryCatch(rvest::jump_to(loggedin, logouturl),
error = function(e) NULL)
}, add = TRUE)
############################################################
# now we can do the real purpose of myfunc
# ...
# ...
}
Granted, this is rather verbose and drawn-out, but with my website, the variables are unpredictable and cluttered and such ... so when I found that this method worked, I stuck with it.
In your case, the premise is that you
find the "log out" link and remember it; and
use on.exit to rvest::jump_to(sess, logouturl)
I have signed in a website using R 3.5.2, and this seems to be gone well both using rvest_0.3.4 and httr_1.4.0, but then I get stuck into a redirecting page which, on the browser (Chrome), is visualized only for a few secs after I hit the button "Login!".
The problematic step seems to be a form method="post" input type="hidden" which I don't manage to submit from R.
URL of the sign in CDP page
signin <- "https://www.cdp.net/en/users/sign_in"
rvest
library(rvest)
user.email <- "my_email"
user.password <- "my_password"
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
read_html(signed.in) %>% html_node("form")
httr
library(httr)
login <- list(
`user[email]` = "my_email",
`user[password]` = "my_password",
submit = "Login!")
signed.in.post <- POST(signin, body = login, encode = "form", verbose())
http_status(signed.in.post)
content(signed.in.post, as = "parsed")
read_html(signed.in.post$url) %>% html_node("form")
My goal is to access my account and browse the website, but I don't know how to go through the redirecting page from R.
updating the previous response by IvanP with the up-to-date httr function names
library(rvest)
signin.session <- session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- html_form_set(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- session_submit(signin.session, filled.signin)
redirect.form <- html_form_set(signed.in)[[1]]
redirected <- session_submit(signed.in, redirect.form)
SOLVED!
It was a quite easy and intuitive solution, I just needed to submit the form method="post" input type="hidden" of the redirecting page, i.e. the one encountered in the signed.in session.
I solved it with rvest but I think that httr would be equally easy, here comes the code I used:
library(rvest)
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
redirect.form <- html_form(signed.in)[[1]]
redirected <- submit_form(signed.in, redirect.form)
This last object redirected is a session-class object, basically the page which can be normally browsed after signing in the website.
In case someone has a shorter, more effective, more elegant/sexy/charming solution to proceed...please don't hesitate to share it.
I'm an absolute beginner of web-scraping, and I am keen to learn more about these operations!
THX
I wanted to scrape some data from following website:
http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search
but when I tried to use rvest:
library(rvest)
session <- html_session("http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search")
form <- html_form(session)
form
it doesn't find the form, even if it is there (as you can see on the page).
I have also tried with POST function from httr package:
parameters <- list(since = "1.6.2018", until = "5.6.2018", `g-recaptcha-response` = "03AF6jDqXcBw1qmbrxWqadGqh9k8eHAzB9iPbYdnwzhEVSgCwO0Mi6DQDgckigpeMH1ikV70egOC0UppZsO7tO9hgdpEIaI04jTpG6JxGMR6wov27kEkLuVsEp1LhxZB4WFDRkDWdqcZeVN1YkiojUpje4k-swFG7tPyG2pJN86SdT290D9_0fyfrxlpfFNL2VUwE_c15vVthcBEdXIQ68V5qv7ZVooLiwrdTO2qLDLF1yUZWiu9IJoLuBWdFzJ_zdSP6fbuj5wTpfPdsYJ2n988Gcb3q2aYdn-2TVuWoQzqs1wbh7ya_Geo7_8gnDUL92l2nqTeV9CMY58fzppPPYDJcchdHFTTxadGwCGZyKC3WUSh81qiGZ5JhNDUpPnOO-MgSr5aPbA7tei7bbypHV9OOVjPGLLtqA9g")
httr::POST(
url,
body = parameters,
config = list(
add_headers(Referer = "http://predstecajnenagodbe.fina.hr"),
user_agent(get_header()),
accept_encoding = get_encoding(),
use_proxy("xxxx", port = 80,
username = "xxx", password = "xxxx"),
timeout(20L),
tcp_keepalive = FALSE
),
encode = "form",
verbose()
)
but it returns some JS code and message:
Please enable JavaScript to view the page content.Your support ID is:
10544975822212666004
could you please explain why rvest doesn't recognize form and why POST doesn't work eater?
First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.