read_html() reads a different URL from my input - r

I'm using rvest package in R. The read_html() function, sometimes it reads a different URL from my input URL. This happens when the input URL does not exist, so it automatically redirects to a similar one. Is there a way to read stop this auto-redirect?
web <- read_html("http://www.thinkbabynames.com/meaning/0/AAGE")
The above URL does not exist, so it actually reads information on the page http://www.thinkbabynames.com/meaning/0/Ag
I only want the information on the exact page if it exists.
Thanks

You could perhaps think about if differently and do the POST request that searches for a given name. Then filter out results for meanings from return content using css attribute = value selectors. Then test the length of the results from filtering and if >0 generate the final url. There is then no re-direct. Even if you don't want the meaning url it effectively does the same thing. There will either be a length of zero when not found versus > 0 when found.
require(httr)
require(magrittr)
require(rvest)
name = 'Jemima'
base = 'http://www.thinkbabynames.com'
headers = c('User-Agent' = 'Mozilla/5.0')
body <- list('s' = name,'g' = '1' ,'q' = '0')
res <- httr::POST(url = 'http://www.thinkbabynames.com/query.php', httr::add_headers(.headers=headers), body = body)
results <- content(res) %>% html_nodes(paste('[href$="' , name, '"]','[href*=meaning]',sep='')) %>% html_attr(., "href")
if(length(results)>0){
results <- paste0(base, results)
}
print(results)

It seems like there should be a way to avoid the redirect and check whether the status code is 200 or 3xx with httr, but I'm not sure what it is. Regardless, you can check if the URL matches what it should:
get_html <- function(url){
req <- httr::GET(url)
if (req$url == url) httr::content(req) else NULL
}
get_html('http://www.thinkbabynames.com/meaning/0/AAGE')
#> NULL
get_html('http://www.thinkbabynames.com/meaning/0/Ag')
#> {xml_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
#> [1] <head>\n<title>Ag - Name Meaning, What does Ag mean?</title>\n<meta ...
#> [2] <body>\r\n<header id="nav"><nav><table width="1200" cellpadding="0" ...

Rvest can query pages as session objects that include the full server response, including the status code.
Use html_session (and jump_to or navigate_to for subsequent pages), instead of read_html.
You can view the status code of the session, and if it is successful (i.e., not a redirect), then you can use read_html to get the content from the session object.
x <- html_session("http://www.thinkbabynames.com/meaning/0/Ag")
statusCode <- x$response$status_code
if(statusCode == 200){
# do this if response is successful
web <- read_html(x)
}
if(statusCode %in% 300:308){
# do this if response is a redirect
}
Status codes can be quite useful in figuring out what to do with your data; see here for more detail: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Related

Log into website using R, rvest session - not able to set unnamed fields

Edit: updated with real, working login credentials, and with a few efforts we've tried, to no success
library(dplyr)
library(rvest)
library(xml2)
login_to_website <- function() {
# Login, create a web session with the desired login address
login_url <- "https://www.sportsbettingdime.com/plus/register/"
pgsession <- session(login_url); Sys.sleep(1)
# grab & submit 1st form on page
pgform <- html_form(pgsession)[[1]];
filled_form <- html_form_set(pgform, email="nickcanova#gmail.com", password="tpe.vxp3ZRT!twq.wkd"); # not working
session_submit(pgsession, filled_form); Sys.sleep(1)
# return session to use in other functions
return(pgsession)
}
We are looking to log into this website, however the fields in the pgform that is returned (pgform$fields) are unnamed. They is no name field in the form when you inspect the website either...
> pgform
<form> '<unnamed>' (GET )
<field> (text) :
<field> (password) :
<field> (button) :
>
As a result, the html_form_set call doesn't work, throwing the error Error: Can't set value of fields that don't exist: ' email ', ' password '.
EDIT: It looks like this works for setting the fields (replace html_form_set with these 4 lines below), however the result still does not end us logged in...
pgform$fields[1][[1]]$value <- 'nickcanova#gmail.com' # add email
pgform$fields[2][[1]]$value <- 'tpe.vxp3ZRT!twq.wkd' # add password
pgform$method = "POST" # change from GET to POST
pgform$action <- login_url # was NULL, cant be NULL, set to login_url? (not sure here)
We set the method to POST and we set the action to the url solely because, in another function that logs into a different website, the method is POST and the action is the URL of the page? Not sure if these are correct though...
How to check if login is successful - in the code below, the columns for money and sharp are omitted when not logged in, and available when logged in. Currently we are getting all NA values in these last 2 columns, the goal is to get actual numbers in these 2 fields:
page_url <- paste0('https://www.sportsbettingdime.com/', league, '/odds/')
page_overall <- pgsession %>% session_jump_to(page_url)
table_list <- page_overall %>% html_nodes('table.odds-table') %>% html_table()
df1 <- table_list[1] %>% as.data.frame()
colnames(df1) = c('teams', 'spread1', 'spread2', 'spread3', 'spread4', 'spread5', 'bet', 'money', 'sharp', 'bet2')
df1[, c('money', 'sharp')]

R programming _ get function for http with query parameters-

I'm using R programming and httr package to request a HTTP
I want to apply get function for http using query parameters.
GET(url = NULL, config = list(), ..., handle = NULL)
the request contains , seperated by question mark ?
1- url https://example.com
2- url parameter:'title='
# Function to Get Links to specific page
page_link <- function(){
url <- "https://example.com?"
q1 <- list (title = "")
page_link<- GET (url, query = q1)
return (page_link)
}
If you're asking how to bind the url to get one element to request with GET, you should try paste0(), e.g.:
url <- paste0("https://example.com?",q1[x])<br>
page_link <- GET(url)

How to check if an url object is reachable or not using try catch in R

I have the following URL objects and need to check if they are reachable before downloading and processing the CSV files. I can't use the URLs directly as it keeps on changing based on previous steps.
My requirement is, read the link if reachable else throw an error and go to the next link.
url1= "https://s3.mydata.csv"
url2="https://s4.mydata.csv"
url3="https://s5.mydata.csv"
(Below code will be repeated for the other 2 URLs as well)
readUrl <- function(url1) {
out <- tryCatch(
{
readLines(con=url, warn=FALSE)
error=function(cond) {
message(cond)
return(NA)
},
finally={
dataread=data.table::fread(url1, sep = ",", header= TRUE,verbose = T,
fill =TRUE,skip = 2 )
}
)
return(out)
}
y <- lapply(urls, readUrl)
Why not the function url.exists directly from package RCurl.
From documentation:
This functions is analogous to file.exists and determines whether a
request for a specific URL responds without error.
Function doc LINK
Using the boolean result of this function you can easly adapt your starting code without Try Catch.

Using rvest or RSelenium to Scrape Table

The goal: Scrape the table from the following website using R.
The website: https://evanalytics.com/mlb/models/teams/advanced
What has me stuck:
I use rvest to automate most of my data gathering process, but this particular site seems to be out of rvest's scope of work (or at least beyond my level of experience). Unfortunately, it doesn't immediately load the table when the page opens. I have tried to come up with a solution via RSelenium but have been unsuccessful in finding the right path to the table (RSelenium is brand new to me). After navigating to the page and pausing for a brief period to allow the table to load, what's next?
What I have so far:
library("rvest")
library("RSelenium")
url <- "https://evanalytics.com/mlb/models/teams/advanced"
remDr <- remoteDriver(remoteServerAddr="192.168.99.100", port=4445L)
remDr$open()
remDr$navigate(url)
Sys.sleep(10)
Any help or guidance would be much appreciated. Thank you!
You can do this without Selenium by creating an html_session so as to pick up the required php session id to pass in cookies. You additionally need an user-agent header. With session in place you can then make a POST xhr request to get all the data. You need a json parser to handle the json content within response html.
You can see the params info in one of the script tags:
function executeEnteredQuery() {
var parameterArray = {
mode: 'runTime',
dataTable_id: 77
};
$.post('/admin/model/datatableQuery.php', {
parameter: window.btoa(jQuery.param(parameterArray))
},
function(res) {
processdataTableQueryResults(res);
}, "json");
}
You can encode the string yourself for params:
base64_enc('mode=runTime&dataTable_id=77')
R:
require(httr)
require(rvest)
require(magrittr)
require(jsonlite)
headers = c('User-Agent' = 'Mozilla/5.0')
body = list('parameter' = 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw==') # base64 encoded params for mode=runTime&dataTable_id=77
session <- html_session('https://evanalytics.com/mlb/models/teams/advanced', httr::add_headers(.headers=headers))
p <- session %>% rvest:::request_POST('https://evanalytics.com/admin/model/datatableQuery.php', body = body)%>%
read_html() %>%
html_node('p') %>%
html_text()
data <- jsonlite::fromJSON(p)
df <- data$dataRows$columns
print(df)
Py:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
body = {'parameter': 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw=='} # base64 encoded params for mode=runTime&dataTable_id=77
with requests.Session() as s:
r = s.get('https://evanalytics.com/mlb/models/teams/advanced')
r = s.post('https://evanalytics.com/admin/model/datatableQuery.php')
data = r.json()
cols = [th.text for th in bs(data['headerRow'], 'lxml').select('th')]
rows = [[td.text for td in bs(row['dataRow'], 'lxml').select('td')] for row in data['dataRows']]
df = pd.DataFrame(rows, columns = cols)
print(df)
Little short on time so just pointing you to html source code, from which you could extract the table with r vest.
remDr$navigate(url)
html <-remDr$getPageSource()
## this will get you html of the page, form here
## just extract the table as you would with rvest

R - form web scraping with rvest

First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.

Resources