R - form web scraping with rvest - r

First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp

Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

Related

Using rvest or RSelenium to Scrape Table

The goal: Scrape the table from the following website using R.
The website: https://evanalytics.com/mlb/models/teams/advanced
What has me stuck:
I use rvest to automate most of my data gathering process, but this particular site seems to be out of rvest's scope of work (or at least beyond my level of experience). Unfortunately, it doesn't immediately load the table when the page opens. I have tried to come up with a solution via RSelenium but have been unsuccessful in finding the right path to the table (RSelenium is brand new to me). After navigating to the page and pausing for a brief period to allow the table to load, what's next?
What I have so far:
library("rvest")
library("RSelenium")
url <- "https://evanalytics.com/mlb/models/teams/advanced"
remDr <- remoteDriver(remoteServerAddr="192.168.99.100", port=4445L)
remDr$open()
remDr$navigate(url)
Sys.sleep(10)
Any help or guidance would be much appreciated. Thank you!
You can do this without Selenium by creating an html_session so as to pick up the required php session id to pass in cookies. You additionally need an user-agent header. With session in place you can then make a POST xhr request to get all the data. You need a json parser to handle the json content within response html.
You can see the params info in one of the script tags:
function executeEnteredQuery() {
var parameterArray = {
mode: 'runTime',
dataTable_id: 77
};
$.post('/admin/model/datatableQuery.php', {
parameter: window.btoa(jQuery.param(parameterArray))
},
function(res) {
processdataTableQueryResults(res);
}, "json");
}
You can encode the string yourself for params:
base64_enc('mode=runTime&dataTable_id=77')
R:
require(httr)
require(rvest)
require(magrittr)
require(jsonlite)
headers = c('User-Agent' = 'Mozilla/5.0')
body = list('parameter' = 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw==') # base64 encoded params for mode=runTime&dataTable_id=77
session <- html_session('https://evanalytics.com/mlb/models/teams/advanced', httr::add_headers(.headers=headers))
p <- session %>% rvest:::request_POST('https://evanalytics.com/admin/model/datatableQuery.php', body = body)%>%
read_html() %>%
html_node('p') %>%
html_text()
data <- jsonlite::fromJSON(p)
df <- data$dataRows$columns
print(df)
Py:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
body = {'parameter': 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw=='} # base64 encoded params for mode=runTime&dataTable_id=77
with requests.Session() as s:
r = s.get('https://evanalytics.com/mlb/models/teams/advanced')
r = s.post('https://evanalytics.com/admin/model/datatableQuery.php')
data = r.json()
cols = [th.text for th in bs(data['headerRow'], 'lxml').select('th')]
rows = [[td.text for td in bs(row['dataRow'], 'lxml').select('td')] for row in data['dataRows']]
df = pd.DataFrame(rows, columns = cols)
print(df)
Little short on time so just pointing you to html source code, from which you could extract the table with r vest.
remDr$navigate(url)
html <-remDr$getPageSource()
## this will get you html of the page, form here
## just extract the table as you would with rvest

How to proceed when redirected to page after successful sign in with POST method

I have signed in a website using R 3.5.2, and this seems to be gone well both using rvest_0.3.4 and httr_1.4.0, but then I get stuck into a redirecting page which, on the browser (Chrome), is visualized only for a few secs after I hit the button "Login!".
The problematic step seems to be a form method="post" input type="hidden" which I don't manage to submit from R.
URL of the sign in CDP page
signin <- "https://www.cdp.net/en/users/sign_in"
rvest
library(rvest)
user.email <- "my_email"
user.password <- "my_password"
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
read_html(signed.in) %>% html_node("form")
httr
library(httr)
login <- list(
`user[email]` = "my_email",
`user[password]` = "my_password",
submit = "Login!")
signed.in.post <- POST(signin, body = login, encode = "form", verbose())
http_status(signed.in.post)
content(signed.in.post, as = "parsed")
read_html(signed.in.post$url) %>% html_node("form")
My goal is to access my account and browse the website, but I don't know how to go through the redirecting page from R.
updating the previous response by IvanP with the up-to-date httr function names
library(rvest)
signin.session <- session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- html_form_set(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- session_submit(signin.session, filled.signin)
redirect.form <- html_form_set(signed.in)[[1]]
redirected <- session_submit(signed.in, redirect.form)
SOLVED!
It was a quite easy and intuitive solution, I just needed to submit the form method="post" input type="hidden" of the redirecting page, i.e. the one encountered in the signed.in session.
I solved it with rvest but I think that httr would be equally easy, here comes the code I used:
library(rvest)
signin.session <- html_session(signin)
signin.form <- html_form(signin.session)[[1]]
filled.signin <- set_values(signin.form,
`user[email]` = user.email,
`user[password]` = user.password)
signed.in <- submit_form(signin.session, filled.signin)
redirect.form <- html_form(signed.in)[[1]]
redirected <- submit_form(signed.in, redirect.form)
This last object redirected is a session-class object, basically the page which can be normally browsed after signing in the website.
In case someone has a shorter, more effective, more elegant/sexy/charming solution to proceed...please don't hesitate to share it.
I'm an absolute beginner of web-scraping, and I am keen to learn more about these operations!
THX

read_html() reads a different URL from my input

I'm using rvest package in R. The read_html() function, sometimes it reads a different URL from my input URL. This happens when the input URL does not exist, so it automatically redirects to a similar one. Is there a way to read stop this auto-redirect?
web <- read_html("http://www.thinkbabynames.com/meaning/0/AAGE")
The above URL does not exist, so it actually reads information on the page http://www.thinkbabynames.com/meaning/0/Ag
I only want the information on the exact page if it exists.
Thanks
You could perhaps think about if differently and do the POST request that searches for a given name. Then filter out results for meanings from return content using css attribute = value selectors. Then test the length of the results from filtering and if >0 generate the final url. There is then no re-direct. Even if you don't want the meaning url it effectively does the same thing. There will either be a length of zero when not found versus > 0 when found.
require(httr)
require(magrittr)
require(rvest)
name = 'Jemima'
base = 'http://www.thinkbabynames.com'
headers = c('User-Agent' = 'Mozilla/5.0')
body <- list('s' = name,'g' = '1' ,'q' = '0')
res <- httr::POST(url = 'http://www.thinkbabynames.com/query.php', httr::add_headers(.headers=headers), body = body)
results <- content(res) %>% html_nodes(paste('[href$="' , name, '"]','[href*=meaning]',sep='')) %>% html_attr(., "href")
if(length(results)>0){
results <- paste0(base, results)
}
print(results)
It seems like there should be a way to avoid the redirect and check whether the status code is 200 or 3xx with httr, but I'm not sure what it is. Regardless, you can check if the URL matches what it should:
get_html <- function(url){
req <- httr::GET(url)
if (req$url == url) httr::content(req) else NULL
}
get_html('http://www.thinkbabynames.com/meaning/0/AAGE')
#> NULL
get_html('http://www.thinkbabynames.com/meaning/0/Ag')
#> {xml_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
#> [1] <head>\n<title>Ag - Name Meaning, What does Ag mean?</title>\n<meta ...
#> [2] <body>\r\n<header id="nav"><nav><table width="1200" cellpadding="0" ...
Rvest can query pages as session objects that include the full server response, including the status code.
Use html_session (and jump_to or navigate_to for subsequent pages), instead of read_html.
You can view the status code of the session, and if it is successful (i.e., not a redirect), then you can use read_html to get the content from the session object.
x <- html_session("http://www.thinkbabynames.com/meaning/0/Ag")
statusCode <- x$response$status_code
if(statusCode == 200){
# do this if response is successful
web <- read_html(x)
}
if(statusCode %in% 300:308){
# do this if response is a redirect
}
Status codes can be quite useful in figuring out what to do with your data; see here for more detail: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

rvest package doesn't recognize the form

I wanted to scrape some data from following website:
http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search
but when I tried to use rvest:
library(rvest)
session <- html_session("http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search")
form <- html_form(session)
form
it doesn't find the form, even if it is there (as you can see on the page).
I have also tried with POST function from httr package:
parameters <- list(since = "1.6.2018", until = "5.6.2018", `g-recaptcha-response` = "03AF6jDqXcBw1qmbrxWqadGqh9k8eHAzB9iPbYdnwzhEVSgCwO0Mi6DQDgckigpeMH1ikV70egOC0UppZsO7tO9hgdpEIaI04jTpG6JxGMR6wov27kEkLuVsEp1LhxZB4WFDRkDWdqcZeVN1YkiojUpje4k-swFG7tPyG2pJN86SdT290D9_0fyfrxlpfFNL2VUwE_c15vVthcBEdXIQ68V5qv7ZVooLiwrdTO2qLDLF1yUZWiu9IJoLuBWdFzJ_zdSP6fbuj5wTpfPdsYJ2n988Gcb3q2aYdn-2TVuWoQzqs1wbh7ya_Geo7_8gnDUL92l2nqTeV9CMY58fzppPPYDJcchdHFTTxadGwCGZyKC3WUSh81qiGZ5JhNDUpPnOO-MgSr5aPbA7tei7bbypHV9OOVjPGLLtqA9g")
httr::POST(
url,
body = parameters,
config = list(
add_headers(Referer = "http://predstecajnenagodbe.fina.hr"),
user_agent(get_header()),
accept_encoding = get_encoding(),
use_proxy("xxxx", port = 80,
username = "xxx", password = "xxxx"),
timeout(20L),
tcp_keepalive = FALSE
),
encode = "form",
verbose()
)
but it returns some JS code and message:
Please enable JavaScript to view the page content.Your support ID is:
10544975822212666004
could you please explain why rvest doesn't recognize form and why POST doesn't work eater?

Submit form with no submit button in rvest

I'm trying write a crawler to download some information, similar to this Stack Overflow post. The answer is useful for creating the filled-in form, but I'm struggling to find a way to submit the form when a submit button is not part of the form. Here is an example:
session <- html_session("www.chase.com")
form <- html_form(session)[[3]]
filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password)
session <- submit_form(session, filledform)
At this point, I receive this error:
Error in names(submits)[[1]] : subscript out of bounds
How can I make this form submit?
Here's a dirty hack that works for me: After studying the submit_form source code, I figured that I could work around the problem by injecting a fake submit button into my code version of the form, and then the submit_form function would call that. It works, except that it gives a warning that often lists an inappropriate input object (not in the example below, though). However, despite the warning, the code works for me:
session <- html_session("www.chase.com")
form <- html_form(session)[[3]]
# Form on home page has no submit button,
# so inject a fake submit button or else rvest cannot submit it.
# When I do this, rvest gives a warning "Submitting with '___'", where "___" is
# often an irrelevant field item.
# This warning might be an rvest (version 0.3.2) bug, but the code works.
fake_submit_button <- list(name = NULL,
type = "submit",
value = NULL,
checked = NULL,
disabled = NULL,
readonly = NULL,
required = FALSE)
attr(fake_submit_button, "class") <- "input"
form[["fields"]][["submit"]] <- fake_submit_button
user_name <- "user"
usr_password <- "password"
filledform <- set_values(form, `user_name` = user_name, `usr_password` = usr_password)
session <- submit_form(session, filledform)
The successful result displays the following warning, which I simply ignore:
> Submitting with 'submit'

Resources