rvest package doesn't recognize the form - r

I wanted to scrape some data from following website:
http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search
but when I tried to use rvest:
library(rvest)
session <- html_session("http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search")
form <- html_form(session)
form
it doesn't find the form, even if it is there (as you can see on the page).
I have also tried with POST function from httr package:
parameters <- list(since = "1.6.2018", until = "5.6.2018", `g-recaptcha-response` = "03AF6jDqXcBw1qmbrxWqadGqh9k8eHAzB9iPbYdnwzhEVSgCwO0Mi6DQDgckigpeMH1ikV70egOC0UppZsO7tO9hgdpEIaI04jTpG6JxGMR6wov27kEkLuVsEp1LhxZB4WFDRkDWdqcZeVN1YkiojUpje4k-swFG7tPyG2pJN86SdT290D9_0fyfrxlpfFNL2VUwE_c15vVthcBEdXIQ68V5qv7ZVooLiwrdTO2qLDLF1yUZWiu9IJoLuBWdFzJ_zdSP6fbuj5wTpfPdsYJ2n988Gcb3q2aYdn-2TVuWoQzqs1wbh7ya_Geo7_8gnDUL92l2nqTeV9CMY58fzppPPYDJcchdHFTTxadGwCGZyKC3WUSh81qiGZ5JhNDUpPnOO-MgSr5aPbA7tei7bbypHV9OOVjPGLLtqA9g")
httr::POST(
url,
body = parameters,
config = list(
add_headers(Referer = "http://predstecajnenagodbe.fina.hr"),
user_agent(get_header()),
accept_encoding = get_encoding(),
use_proxy("xxxx", port = 80,
username = "xxx", password = "xxxx"),
timeout(20L),
tcp_keepalive = FALSE
),
encode = "form",
verbose()
)
but it returns some JS code and message:
Please enable JavaScript to view the page content.Your support ID is:
10544975822212666004
could you please explain why rvest doesn't recognize form and why POST doesn't work eater?

Related

R programming _ get function for http with query parameters-

I'm using R programming and httr package to request a HTTP
I want to apply get function for http using query parameters.
GET(url = NULL, config = list(), ..., handle = NULL)
the request contains , seperated by question mark ?
1- url https://example.com
2- url parameter:'title='
# Function to Get Links to specific page
page_link <- function(){
url <- "https://example.com?"
q1 <- list (title = "")
page_link<- GET (url, query = q1)
return (page_link)
}
If you're asking how to bind the url to get one element to request with GET, you should try paste0(), e.g.:
url <- paste0("https://example.com?",q1[x])<br>
page_link <- GET(url)

add optional query parameters into R POST request

I want to make a POST request to https://rest.ensembl.org. Currently the following works:
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""),
content_type("application/json"),
accept("application/json"),
body = '{ "hgvs_notations" : ["chr2:g.10216G>T"] }')
which results in this URL https://rest.ensembl.org/vep/human/hgvs/chr2:g.10216G>T. I would like to use the ? parameter to modify my URL to https://rest.ensembl.org/vep/human/hgvs/chr2:g.10216G>T?CADD=1 however I can't see how to do this in the POST request function in R.
Any help would be great!
If it is always the same parameter you need to send, why not just include it in the URI then?
You could do something like POST( paste0(server, ext, '?CADD=1'), [...] ).
Or would that not be dynamic enough for your usecase?
The following would be the less hacky way to include parameters:
library(httr)
library(jsonlite)
r <- POST(
paste0(server, ext),
query = list('CADD' = 1),
content_type_json(),
accept_json(),
body = toJSON(list('hgvs_notations' = c('chr2:g.10216G>T')))
)

How to proceed when redirected to page after successful sign in with POST method

I have signed in a website using R 3.5.2, and this seems to be gone well both using rvest_0.3.4 and httr_1.4.0, but then I get stuck into a redirecting page which, on the browser (Chrome), is visualized only for a few secs after I hit the button "Login!".
The problematic step seems to be a form method="post" input type="hidden" which I don't manage to submit from R.
URL of the sign in CDP page
signin <- "https://www.cdp.net/en/users/sign_in"
rvest
library(rvest)
user.email <- "my_email"
user.password <- "my_password"
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
read_html(signed.in) %>% html_node("form")
httr
library(httr)
login <- list(
`user[email]` = "my_email",
`user[password]` = "my_password",
submit = "Login!")
signed.in.post <- POST(signin, body = login, encode = "form", verbose())
http_status(signed.in.post)
content(signed.in.post, as = "parsed")
read_html(signed.in.post$url) %>% html_node("form")
My goal is to access my account and browse the website, but I don't know how to go through the redirecting page from R.
updating the previous response by IvanP with the up-to-date httr function names
library(rvest)
signin.session <- session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- html_form_set(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- session_submit(signin.session, filled.signin)
redirect.form <- html_form_set(signed.in)[[1]]
redirected <- session_submit(signed.in, redirect.form)
SOLVED!
It was a quite easy and intuitive solution, I just needed to submit the form method="post" input type="hidden" of the redirecting page, i.e. the one encountered in the signed.in session.
I solved it with rvest but I think that httr would be equally easy, here comes the code I used:
library(rvest)
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
redirect.form <- html_form(signed.in)[[1]]
redirected <- submit_form(signed.in, redirect.form)
This last object redirected is a session-class object, basically the page which can be normally browsed after signing in the website.
In case someone has a shorter, more effective, more elegant/sexy/charming solution to proceed...please don't hesitate to share it.
I'm an absolute beginner of web-scraping, and I am keen to learn more about these operations!
THX

Cannot connect to todoist REST API with R

I am not very good with working with API's "From scratch" so to speak. My issue here is probably more to do with my ignorance of RESTful API's than the Todoist API specifically, but I'm struggling with Todoist because all of their documentation is geared around python and I'm not sure why my feeble attempts are failing. Once I get connected/authenticated I think I'll be fine.
Todoist documentation
I've tried a couple of configurations using httr::GET(). I would appreciate a little push here as I get started.
Things I've tried, where key is my api token:
library(httr)
r<-GET("https://beta.todoist.com/API/v8/", add_headers(hdr))
for hdr, I've used a variety of things:
hdr<-paste0("Authorization: Bearer", key)
just my key
I also tried with projects at the end of the url
UPDATE These are now implemented in the R package rtodoist.
I think you nearly had it except the url? (or maybe it changed since then) and the header. The following works for me, replacing my_todoist_token with API token found here.
library(jsonlite)
library(httr)
projects_api_url <- "https://api.todoist.com/rest/v1/projects"
# to get the project as a data frame
header <- add_headers(Authorization = paste("Bearer ", my_todoist_token))
project_df <- GET(url = projects_api_url, header) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE)
# to create a new project
# unfortunately no way to change the dot color associated with project
header2 <- add_headers(
Authorization = paste("Bearer ", my_todoist_token),
`Content-Type` = "application/json",
`X-Request-Id` = uuid::UUIDgenerate())
POST(url = projects_api_url, header2,
body = list(name = "Your New Project Name"
# parent = parentID
),
encode = "json")
# get a project given project id
GET(url = paste0(projects_api_url, "/", project_df$id[10]),
header) %>%
content("text", encoding = "UTF-8") %>%
fromJSON(flatten = TRUE)
# update a project
POST(url = paste0(projects_api_url, "/", project_df$id[10]),
header2, body = list(name = "IBS-AR Biometric 2019"), encode = "json")

R - form web scraping with rvest

First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

Resources