R programming _ get function for http with query parameters- - r

I'm using R programming and httr package to request a HTTP
I want to apply get function for http using query parameters.
GET(url = NULL, config = list(), ..., handle = NULL)
the request contains , seperated by question mark ?
1- url https://example.com
2- url parameter:'title='
# Function to Get Links to specific page
page_link <- function(){
url <- "https://example.com?"
q1 <- list (title = "")
page_link<- GET (url, query = q1)
return (page_link)
}

If you're asking how to bind the url to get one element to request with GET, you should try paste0(), e.g.:
url <- paste0("https://example.com?",q1[x])<br>
page_link <- GET(url)

Related

Using rvest or RSelenium to Scrape Table

The goal: Scrape the table from the following website using R.
The website: https://evanalytics.com/mlb/models/teams/advanced
What has me stuck:
I use rvest to automate most of my data gathering process, but this particular site seems to be out of rvest's scope of work (or at least beyond my level of experience). Unfortunately, it doesn't immediately load the table when the page opens. I have tried to come up with a solution via RSelenium but have been unsuccessful in finding the right path to the table (RSelenium is brand new to me). After navigating to the page and pausing for a brief period to allow the table to load, what's next?
What I have so far:
library("rvest")
library("RSelenium")
url <- "https://evanalytics.com/mlb/models/teams/advanced"
remDr <- remoteDriver(remoteServerAddr="192.168.99.100", port=4445L)
remDr$open()
remDr$navigate(url)
Sys.sleep(10)
Any help or guidance would be much appreciated. Thank you!
You can do this without Selenium by creating an html_session so as to pick up the required php session id to pass in cookies. You additionally need an user-agent header. With session in place you can then make a POST xhr request to get all the data. You need a json parser to handle the json content within response html.
You can see the params info in one of the script tags:
function executeEnteredQuery() {
var parameterArray = {
mode: 'runTime',
dataTable_id: 77
};
$.post('/admin/model/datatableQuery.php', {
parameter: window.btoa(jQuery.param(parameterArray))
},
function(res) {
processdataTableQueryResults(res);
}, "json");
}
You can encode the string yourself for params:
base64_enc('mode=runTime&dataTable_id=77')
R:
require(httr)
require(rvest)
require(magrittr)
require(jsonlite)
headers = c('User-Agent' = 'Mozilla/5.0')
body = list('parameter' = 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw==') # base64 encoded params for mode=runTime&dataTable_id=77
session <- html_session('https://evanalytics.com/mlb/models/teams/advanced', httr::add_headers(.headers=headers))
p <- session %>% rvest:::request_POST('https://evanalytics.com/admin/model/datatableQuery.php', body = body)%>%
read_html() %>%
html_node('p') %>%
html_text()
data <- jsonlite::fromJSON(p)
df <- data$dataRows$columns
print(df)
Py:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
body = {'parameter': 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw=='} # base64 encoded params for mode=runTime&dataTable_id=77
with requests.Session() as s:
r = s.get('https://evanalytics.com/mlb/models/teams/advanced')
r = s.post('https://evanalytics.com/admin/model/datatableQuery.php')
data = r.json()
cols = [th.text for th in bs(data['headerRow'], 'lxml').select('th')]
rows = [[td.text for td in bs(row['dataRow'], 'lxml').select('td')] for row in data['dataRows']]
df = pd.DataFrame(rows, columns = cols)
print(df)
Little short on time so just pointing you to html source code, from which you could extract the table with r vest.
remDr$navigate(url)
html <-remDr$getPageSource()
## this will get you html of the page, form here
## just extract the table as you would with rvest

add optional query parameters into R POST request

I want to make a POST request to https://rest.ensembl.org. Currently the following works:
server <- "https://rest.ensembl.org"
ext <- "/vep/human/hgvs"
r <- POST(paste(server, ext, sep = ""),
content_type("application/json"),
accept("application/json"),
body = '{ "hgvs_notations" : ["chr2:g.10216G>T"] }')
which results in this URL https://rest.ensembl.org/vep/human/hgvs/chr2:g.10216G>T. I would like to use the ? parameter to modify my URL to https://rest.ensembl.org/vep/human/hgvs/chr2:g.10216G>T?CADD=1 however I can't see how to do this in the POST request function in R.
Any help would be great!
If it is always the same parameter you need to send, why not just include it in the URI then?
You could do something like POST( paste0(server, ext, '?CADD=1'), [...] ).
Or would that not be dynamic enough for your usecase?
The following would be the less hacky way to include parameters:
library(httr)
library(jsonlite)
r <- POST(
paste0(server, ext),
query = list('CADD' = 1),
content_type_json(),
accept_json(),
body = toJSON(list('hgvs_notations' = c('chr2:g.10216G>T')))
)

read_html() reads a different URL from my input

I'm using rvest package in R. The read_html() function, sometimes it reads a different URL from my input URL. This happens when the input URL does not exist, so it automatically redirects to a similar one. Is there a way to read stop this auto-redirect?
web <- read_html("http://www.thinkbabynames.com/meaning/0/AAGE")
The above URL does not exist, so it actually reads information on the page http://www.thinkbabynames.com/meaning/0/Ag
I only want the information on the exact page if it exists.
Thanks
You could perhaps think about if differently and do the POST request that searches for a given name. Then filter out results for meanings from return content using css attribute = value selectors. Then test the length of the results from filtering and if >0 generate the final url. There is then no re-direct. Even if you don't want the meaning url it effectively does the same thing. There will either be a length of zero when not found versus > 0 when found.
require(httr)
require(magrittr)
require(rvest)
name = 'Jemima'
base = 'http://www.thinkbabynames.com'
headers = c('User-Agent' = 'Mozilla/5.0')
body <- list('s' = name,'g' = '1' ,'q' = '0')
res <- httr::POST(url = 'http://www.thinkbabynames.com/query.php', httr::add_headers(.headers=headers), body = body)
results <- content(res) %>% html_nodes(paste('[href$="' , name, '"]','[href*=meaning]',sep='')) %>% html_attr(., "href")
if(length(results)>0){
results <- paste0(base, results)
}
print(results)
It seems like there should be a way to avoid the redirect and check whether the status code is 200 or 3xx with httr, but I'm not sure what it is. Regardless, you can check if the URL matches what it should:
get_html <- function(url){
req <- httr::GET(url)
if (req$url == url) httr::content(req) else NULL
}
get_html('http://www.thinkbabynames.com/meaning/0/AAGE')
#> NULL
get_html('http://www.thinkbabynames.com/meaning/0/Ag')
#> {xml_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
#> [1] <head>\n<title>Ag - Name Meaning, What does Ag mean?</title>\n<meta ...
#> [2] <body>\r\n<header id="nav"><nav><table width="1200" cellpadding="0" ...
Rvest can query pages as session objects that include the full server response, including the status code.
Use html_session (and jump_to or navigate_to for subsequent pages), instead of read_html.
You can view the status code of the session, and if it is successful (i.e., not a redirect), then you can use read_html to get the content from the session object.
x <- html_session("http://www.thinkbabynames.com/meaning/0/Ag")
statusCode <- x$response$status_code
if(statusCode == 200){
# do this if response is successful
web <- read_html(x)
}
if(statusCode %in% 300:308){
# do this if response is a redirect
}
Status codes can be quite useful in figuring out what to do with your data; see here for more detail: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

rvest package doesn't recognize the form

I wanted to scrape some data from following website:
http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search
but when I tried to use rvest:
library(rvest)
session <- html_session("http://predstecajnenagodbe.fina.hr/pn-public-web/predmet/search")
form <- html_form(session)
form
it doesn't find the form, even if it is there (as you can see on the page).
I have also tried with POST function from httr package:
parameters <- list(since = "1.6.2018", until = "5.6.2018", `g-recaptcha-response` = "03AF6jDqXcBw1qmbrxWqadGqh9k8eHAzB9iPbYdnwzhEVSgCwO0Mi6DQDgckigpeMH1ikV70egOC0UppZsO7tO9hgdpEIaI04jTpG6JxGMR6wov27kEkLuVsEp1LhxZB4WFDRkDWdqcZeVN1YkiojUpje4k-swFG7tPyG2pJN86SdT290D9_0fyfrxlpfFNL2VUwE_c15vVthcBEdXIQ68V5qv7ZVooLiwrdTO2qLDLF1yUZWiu9IJoLuBWdFzJ_zdSP6fbuj5wTpfPdsYJ2n988Gcb3q2aYdn-2TVuWoQzqs1wbh7ya_Geo7_8gnDUL92l2nqTeV9CMY58fzppPPYDJcchdHFTTxadGwCGZyKC3WUSh81qiGZ5JhNDUpPnOO-MgSr5aPbA7tei7bbypHV9OOVjPGLLtqA9g")
httr::POST(
url,
body = parameters,
config = list(
add_headers(Referer = "http://predstecajnenagodbe.fina.hr"),
user_agent(get_header()),
accept_encoding = get_encoding(),
use_proxy("xxxx", port = 80,
username = "xxx", password = "xxxx"),
timeout(20L),
tcp_keepalive = FALSE
),
encode = "form",
verbose()
)
but it returns some JS code and message:
Please enable JavaScript to view the page content.Your support ID is:
10544975822212666004
could you please explain why rvest doesn't recognize form and why POST doesn't work eater?

R - form web scraping with rvest

First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

Resources