The goal: Scrape the table from the following website using R.
The website: https://evanalytics.com/mlb/models/teams/advanced
What has me stuck:
I use rvest to automate most of my data gathering process, but this particular site seems to be out of rvest's scope of work (or at least beyond my level of experience). Unfortunately, it doesn't immediately load the table when the page opens. I have tried to come up with a solution via RSelenium but have been unsuccessful in finding the right path to the table (RSelenium is brand new to me). After navigating to the page and pausing for a brief period to allow the table to load, what's next?
What I have so far:
library("rvest")
library("RSelenium")
url <- "https://evanalytics.com/mlb/models/teams/advanced"
remDr <- remoteDriver(remoteServerAddr="192.168.99.100", port=4445L)
remDr$open()
remDr$navigate(url)
Sys.sleep(10)
Any help or guidance would be much appreciated. Thank you!
You can do this without Selenium by creating an html_session so as to pick up the required php session id to pass in cookies. You additionally need an user-agent header. With session in place you can then make a POST xhr request to get all the data. You need a json parser to handle the json content within response html.
You can see the params info in one of the script tags:
function executeEnteredQuery() {
var parameterArray = {
mode: 'runTime',
dataTable_id: 77
};
$.post('/admin/model/datatableQuery.php', {
parameter: window.btoa(jQuery.param(parameterArray))
},
function(res) {
processdataTableQueryResults(res);
}, "json");
}
You can encode the string yourself for params:
base64_enc('mode=runTime&dataTable_id=77')
R:
require(httr)
require(rvest)
require(magrittr)
require(jsonlite)
headers = c('User-Agent' = 'Mozilla/5.0')
body = list('parameter' = 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw==') # base64 encoded params for mode=runTime&dataTable_id=77
session <- html_session('https://evanalytics.com/mlb/models/teams/advanced', httr::add_headers(.headers=headers))
p <- session %>% rvest:::request_POST('https://evanalytics.com/admin/model/datatableQuery.php', body = body)%>%
read_html() %>%
html_node('p') %>%
html_text()
data <- jsonlite::fromJSON(p)
df <- data$dataRows$columns
print(df)
Py:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
body = {'parameter': 'bW9kZT1ydW5UaW1lJmRhdGFUYWJsZV9pZD03Nw=='} # base64 encoded params for mode=runTime&dataTable_id=77
with requests.Session() as s:
r = s.get('https://evanalytics.com/mlb/models/teams/advanced')
r = s.post('https://evanalytics.com/admin/model/datatableQuery.php')
data = r.json()
cols = [th.text for th in bs(data['headerRow'], 'lxml').select('th')]
rows = [[td.text for td in bs(row['dataRow'], 'lxml').select('td')] for row in data['dataRows']]
df = pd.DataFrame(rows, columns = cols)
print(df)
Little short on time so just pointing you to html source code, from which you could extract the table with r vest.
remDr$navigate(url)
html <-remDr$getPageSource()
## this will get you html of the page, form here
## just extract the table as you would with rvest
Related
I'm using R programming and httr package to request a HTTP
I want to apply get function for http using query parameters.
GET(url = NULL, config = list(), ..., handle = NULL)
the request contains , seperated by question mark ?
1- url https://example.com
2- url parameter:'title='
# Function to Get Links to specific page
page_link <- function(){
url <- "https://example.com?"
q1 <- list (title = "")
page_link<- GET (url, query = q1)
return (page_link)
}
If you're asking how to bind the url to get one element to request with GET, you should try paste0(), e.g.:
url <- paste0("https://example.com?",q1[x])<br>
page_link <- GET(url)
I'm using rvest package in R. The read_html() function, sometimes it reads a different URL from my input URL. This happens when the input URL does not exist, so it automatically redirects to a similar one. Is there a way to read stop this auto-redirect?
web <- read_html("http://www.thinkbabynames.com/meaning/0/AAGE")
The above URL does not exist, so it actually reads information on the page http://www.thinkbabynames.com/meaning/0/Ag
I only want the information on the exact page if it exists.
Thanks
You could perhaps think about if differently and do the POST request that searches for a given name. Then filter out results for meanings from return content using css attribute = value selectors. Then test the length of the results from filtering and if >0 generate the final url. There is then no re-direct. Even if you don't want the meaning url it effectively does the same thing. There will either be a length of zero when not found versus > 0 when found.
require(httr)
require(magrittr)
require(rvest)
name = 'Jemima'
base = 'http://www.thinkbabynames.com'
headers = c('User-Agent' = 'Mozilla/5.0')
body <- list('s' = name,'g' = '1' ,'q' = '0')
res <- httr::POST(url = 'http://www.thinkbabynames.com/query.php', httr::add_headers(.headers=headers), body = body)
results <- content(res) %>% html_nodes(paste('[href$="' , name, '"]','[href*=meaning]',sep='')) %>% html_attr(., "href")
if(length(results)>0){
results <- paste0(base, results)
}
print(results)
It seems like there should be a way to avoid the redirect and check whether the status code is 200 or 3xx with httr, but I'm not sure what it is. Regardless, you can check if the URL matches what it should:
get_html <- function(url){
req <- httr::GET(url)
if (req$url == url) httr::content(req) else NULL
}
get_html('http://www.thinkbabynames.com/meaning/0/AAGE')
#> NULL
get_html('http://www.thinkbabynames.com/meaning/0/Ag')
#> {xml_document}
#> <html xmlns="http://www.w3.org/1999/xhtml" lang="en">
#> [1] <head>\n<title>Ag - Name Meaning, What does Ag mean?</title>\n<meta ...
#> [2] <body>\r\n<header id="nav"><nav><table width="1200" cellpadding="0" ...
Rvest can query pages as session objects that include the full server response, including the status code.
Use html_session (and jump_to or navigate_to for subsequent pages), instead of read_html.
You can view the status code of the session, and if it is successful (i.e., not a redirect), then you can use read_html to get the content from the session object.
x <- html_session("http://www.thinkbabynames.com/meaning/0/Ag")
statusCode <- x$response$status_code
if(statusCode == 200){
# do this if response is successful
web <- read_html(x)
}
if(statusCode %in% 300:308){
# do this if response is a redirect
}
Status codes can be quite useful in figuring out what to do with your data; see here for more detail: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
Apologizes in advanced if this is very basic, but I'm lost on this!
I want to scrape the following table in R,
http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx
However, this page is written in, I believe, Java. I tried with RSelenium, but I am no having success in scraping the 17 pages of this table.
Could you give me a hint about how to scrape the entire content of this table?
Given it's just 17 pages, I would manually click through the pages and save the HTML source. It would take no more than 3-5 minutes this way.
However, if you want to do it programmatically, we can start by writing a function that takes a page number, finds the link for that page, clicks on the link, and returns the HTML source for that page:
get_html <- function(i) {
webElem <- remDr$findElement(using = "link text", as.character(i))
webElem$clickElement()
Sys.sleep(s)
remDr$getPageSource()[[1]]
}
Initialize some values:
s <- 2 # seconds to wait between each page
total_pages <- 17
html_pages <- vector("list", total_pages)
Start the browser, navigate to page 1, and save the source:
library(RSelenium)
rD <- rsDriver()
remDr <- rD[["client"]]
base_url <- "http://dgsp.cns.gob.mx/Transparencia/wConsultasGeneral.aspx"
remDr$navigate(base_url)
src <- remDr$getPageSource()[[1]]
html_pages[1] <- src
For pages 2 to 17, we use a for-loop and call the function we wrote above, taking care to account specially for page 11:
for (i in 2:total_pages) {
if (i == 11) {
webElem <- remDr$findElement(using = "link text", "...")
webElem$clickElement()
Sys.sleep(s)
html_pages[i] <- remDr$getPageSource()[[1]]
} else {
html_pages[i] <- get_html(i)
}
}
remDr$close()
The result is html_pages, a list of length 17, with each element the HTML source for each page. How to parse the data from HTML into some other form (e.g. a dataframe) is probably a separate question by itself.
First I'd like to take a moment and thank the SO community,
You helped me many times in the past without me needing to even create an account.
My current problem involves web scraping with R. Not my strong point.
I would like to scrap http://www.cbs.dtu.dk/services/SignalP/
what I have tried:
library(rvest)
url <- "http://www.cbs.dtu.dk/services/SignalP/"
seq <- "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM"
session <- rvest::html_session(url)
form <- rvest::html_form(session)[[2]]
form <- rvest::set_values(form, `SEQPASTE` = seq)
form_res_cbs <- rvest::submit_form(session, form)
#rvest prints out:
Submitting with 'trunc'
rvest::html_text(rvest::html_nodes(form_res_cbs, "head"))
#ouput:
"Configuration error"
rvest::html_text(rvest::html_nodes(form_res_cbs, "body"))
#ouput:
"Exception:WebfaceConfigErrorPackage:Webface::service : 358Message:Unhandled #parameter 'NULL' in form "
I am unsure what is the unhandled parameter.
Is the problem in the submit button? I can not seem to force:
form_res_cbs <- rvest::submit_form(session, form, submit = "submit")
#rvest prints out
Error: Unknown submission name 'submit'.
Possible values: trunc
is the problem the submit$name is NULL?
form[["fields"]][[23]]
I tried defining the fake submit button as suggested here:
Submit form with no submit button in rvest
with no luck.
I am open to solutions using rvest or RCurl/httr, I would like to avoid using RSelenium
EDIT: thanks to hrbrmstr awesome answer I was able to build a function for this task. It is available in the package ragp: https://github.com/missuse/ragp
Well, this is doable. But it's going to require elbow grease.
This part:
library(rvest)
library(httr)
library(tidyverse)
POST(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
encode = "form",
body=list(
`configfile` = "/usr/opt/www/pub/CBS/services/SignalP-4.1/SignalP.cf",
`SEQPASTE` = "MTSKTCLVFFFSSLILTNFALAQDRAPHGLAYETPVAFSPSAFDFFHTQPENPDPTFNPCSESGCSPLPVAAKVQGASAKAQESDIVSISTGTRSGIEEHGVVGIIFGLAFAVMM",
`orgtype` = "euk",
`Dcut-type` = "default",
`Dcut-noTM` = "0.45",
`Dcut-TM` = "0.50",
`graphmode` = "png",
`format` = "summary",
`minlen` = "",
`method` = "best",
`trunc` = ""
),
verbose()
) -> res
Makes the request you made. I left verbose() in so you can watch what happens. It's missing the "filename" field, but you specified the string, so it's a good mimic of what you did.
Now, the tricky part is that it uses an intermediary redirect page that gives you a chance to enter an e-mail address for notification when the query is done. It does do a regular (every ~10s or so) check to see if the query is finished and will redirect quickly if so.
That page has the query id which can be extracted via:
content(res, as="parsed") %>%
html_nodes("input[name='jobid']") %>%
html_attr("value") -> jobid
Now, we can mimic the final request, but I'd add in a Sys.sleep(20) before doing so to ensure the report is done.
GET(
url = "http://www.cbs.dtu.dk/cgi-bin/webface2.fcgi",
query = list(
jobid = jobid,
wait = "20"
),
verbose()
) -> res2
That grabs the final results page:
html_print(HTML(content(res2, as="text")))
You can see images are missing because GET only retrieves the HTML content. You can use functions from rvest/xml2 to parse through the page and scrape out the tables and the URLs that you can then use to get new content.
To do all this, I used burpsuite to intercept a browser session and then my burrp R package to inspect the results. You can also visually inspect in burpsuite and build things more manually.
I'm using this amazing package to be able to read and upload data with my shiny app. It's working ok, but when I add a row to the sheet, it does not keep the same encoding from server, neither behaves like the data in the previous rows. Spanish names I manually entered are OK, but when I use the app to load data, special latin characters (UTF-8) are replaced in the sheet.
That data, is not recognized by the app in the following sessions.
library(googlesheets)
table <- "Reportes"
saveData <- function(data) {
# Grab the Google Sheet
sheet <- gs_title(table)
# Add the data as a new row
gs_add_row(sheet, input = data)
}
loadData <- function() {
# Grab the Google Sheet
sheet <- gs_title(table)
# Read the data
gs_read_csv(sheet)
}
Then, I use a button in the UI, and an observer in the SERVER to load the data...
observeEvent(input$enviar, {
exit <- input$enviar
if (exit==1){
addData <- c( as.character(input$fecha),
as.character(input$local),
as.character(input$dpto),
as.character(input$estado),
as.character(input$fsiembra),
as.character(input$ref),
as.character(loc$lat[loc$Departamento==input$dpto & loc$Localidad==input$local]),
as.character(loc$long[loc$Departamento==input$dpto & loc$Localidad==input$local]),
as.character(getZafra(input$fecha)))
saveData(addData)
d <- loadData()
reset('fecha')
reset('dpto')
reset('local')
reset('estado')
reset('fsiembra')
reset('ref')
reset('pass')
disable('enviar')
}
})
Please... if anyone can help I'd be very happy.
I discovered that I needed to encode the character vector before uploding...
I used:
Encoding(addData) = "latin1"
saveData(addData)
and worked just fine!.