Error when attempting to submit form with Rvest - r

Adapting this SO answer, I'm trying to use rvest to generate a form to scrape the resulting page. I keep coming up with an error.
library(rvest)
url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214"
pg.session <- html_session(url)
pg.form <- html_form(html(pg.session))
filled_form <- set_values(pg.form[[1]],
Month = "8",
Year = "1")
out <- submit_form(session = pg.session, pg.form)
returns this error
Submitting with ''
Error in if (!(submit %in% names(submits))) { :
argument is of length zero
What am I doing wrong?

Well, for one thing, you are not submitting the form you actually filled in and you are also attempting to pass in a list of forms rather than a form, but also it appears there may be a bug in the code that doesn't recognize submit buttons with upper case tags. In this case, the HTML has the code
<INPUT TYPE="SUBMIT" VALUE="Get Prices">
and the submit_form codes calls submit_request which looks for submit buttons via
submits <- Filter(function(x) identical(x$type, "submit"),
form$fields)
and since it checks for values identical to "submit" it's not finding "SUBMIT"
sapply(pg.form[[1]]$fields, function(x) x$type)
# $Market_ID
# [1] "HIDDEN"
# $Month
# NULL
# $Year
# NULL
# $`NULL`
# [1] "SUBMIT"
The easiest thing might be to change it ourselves
filled_form <- set_values(pg.form[[1]],
Month = "08",
Year = "2007")
filled_form$fields[[4]]$type <- "submit"
The other problem is that this version has a bug in the way the URL for the form us resolved. we can fix it with
# incorrectly was: url <- XML::getRelativeURL(session$url, form$url)
body(submit_form)[[3]]<-quote(url <- XML::getRelativeURL(form$url, session$url))
And now finally we can submit the request
out <- submit_form(session = pg.session, filled_form)
# out %>% html_table()
(Tested with rvest_0.2.0.9000)

Related

RSelenium the web page kept on loading after clicking the next button

I am new to web scraping and want to scrape data from https://www.forwardpathway.com/us-college-database. I used the following code to extract the data from the table but the page just kept on loading after I clicked the next button. Can anybody point out what is wrong?
library(RSelenium)
library(tidyverse)
library(netstat)
library(xml2)
library(data.table)
library(rvest)
binman::list_versions("chromedriver")
rs_driver_object<-rsDriver(browser="chrome",
chromever="107.0.5304.62",
verbose=F,
port=free_port())
## create the client
remDr<-rs_driver_object$client
## open the brower
remDr$open()
remDr$navigate("https://www.forwardpathway.com/us-college-database")
## locate the table that stores the data
data_table<-remDr$findElement(using = "id","table_1")
#And I tried three different methods to click the next button, but the problem persisted.
## next button method 1
next_button<-remDr$findElement(using = "id",'table_1_next')
next_button$clickElement()
## next button method 2
remDr$executeScript("document.getElementById('table_1_next').click()")
## next button method 3
next_button <- remDr$findElement("id", "table_1_next")
next_button$sendKeysToElement(list(key="enter"))
all_data<-list()
cond<-TRUE
while(cond == TRUE){
data_table_html<-data_table$getPageSource()
page<-read_html(data_table_html %>% unlist())
df<-html_table(page) %>% .[[1]]
all_data<-rbindlist((list(all_data,df)))
Sys.sleep(5)
tryCatch(
{next_button <- remDr$findElement("id", "table_1_next")
next_button$sendKeysToElement(list(key="enter"))
},
error=function(e){
print("script complete")
cond<<-FALSE
}
)
if (cond ==FALSE){
break
}
}

Why am I having a problem downloading patent data with "patentsview" in R

I am trying to fetch patent data with the "patentsview" package in R but I am always getting an error and I couldn't find the solution anywhere. Here's my code -
# Load library
library(patentsview)
# Write query
query <- with_qfuns(
and(
begins(cpc_subgroup_id = 'G06N'),
gte(patent_year = 2020)
)
)
# Create a list of fields
# get_fields(endpoint = "patents")
# Needed Fields
fields <- c(
"patent_id",
"patent_title",
"patent_abstract",
"patent_date"
)
# Send an HTTP request to the PatentsView API to get the data
pv_res <- search_pv(query = query, fields = fields, all_pages = TRUE)
The output is -
Error in xheader_er_or_status(resp) : Not Found (HTTP 404).
What am I doing wrong here? And what is the solution?

Webscraping with R a continuous page with "view more"

I'm new to R and need to scrape the titles and the dates on the posts on this website https://www.healthnewsreview.org/news-release-reviews/
Using rvest I was able to write the basic code to get the info:
url <- 'https://www.healthnewsreview.org/?post_type=news-release-review&s='
webpage <- read_html(url)
date_data_html <- html_nodes(webpage,'span.date')
date_data <- html_text(date_data_html)
head(date_data)
webpage <- read_html(url)
title_data_html <- html_nodes(webpage,'h2')
title_data <- html_text(title_data_html)
head(title_data)
But since the website only displays 10 items at first, and then you have to click "view more" I don't know how to scrape the whole site. Thank you!!
Introducing third-party dependencies should be done as a last resort. RSelenium (as r2evans posited as the only solution, originally) is not necessary the vast majority of the time, including now. (It is necessary for gosh-awful sites that use horrible tech like SharePoint since maintaining state without a browser context for that is more pain than it's worth).)
If we start with the main page:
library(rvest)
pg <- read_html("https://www.healthnewsreview.org/news-release-reviews/")
We can get the first set of links (10 of them):
pg %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .)
## [1] "https://www.healthnewsreview.org/news-release-review/more-unwarranted-hype-over-the-unique-benefits-of-proton-therapy-this-time-in-combo-with-thermal-therapy/"
## [2] "https://www.healthnewsreview.org/news-release-review/caveats-and-outside-expert-balance-speculative-claim-that-anti-inflammatory-diet-might-benefit-bipolar-disorder-patients/"
## [3] "https://www.healthnewsreview.org/news-release-review/plug-for-study-of-midwifery-for-low-income-women-is-fuzzy-on-benefits-costs/"
## [4] "https://www.healthnewsreview.org/news-release-review/tiny-safety-trial-prematurely-touts-clinical-benefit-of-cancer-vaccine-for-her2-positive-cancers/"
## [5] "https://www.healthnewsreview.org/news-release-review/claim-that-milk-protein-alleviates-chemotherapy-side-effects-based-on-study-of-just-12-people/"
## [6] "https://www.healthnewsreview.org/news-release-review/observational-study-cant-prove-surgery-better-than-more-conservative-prostate-cancer-treatment/"
## [7] "https://www.healthnewsreview.org/news-release-review/recap-of-mental-imagery-for-weight-loss-study-requires-that-readers-fill-in-the-blanks/"
## [8] "https://www.healthnewsreview.org/news-release-review/bmjs-attempt-to-hook-readers-on-benefits-of-golf-slices-way-out-of-bounds/"
## [9] "https://www.healthnewsreview.org/news-release-review/time-to-test-all-infants-gut-microbiomes-or-is-this-a-product-in-search-of-a-condition/"
## [10] "https://www.healthnewsreview.org/news-release-review/zika-vaccine-for-brain-cancer-pr-release-headline-omits-crucial-words-in-mice/"
I guess you want to scrape the content of those ^^ so have at it.
But, there's that pesky "View more" button.
When you click on it, it issues this POST request:
With curlconverter we can convert it into a callable httr function (which may not exist given the impossibility of this task). We can wrap that function call in in another function with a pagination parameter:
view_more <- function(current_offset=10) {
httr::POST(
url = "https://www.healthnewsreview.org/wp-admin/admin-ajax.php",
httr::add_headers(
`X-Requested-With` = "XMLHttpRequest"
),
body = list(
action = "viewMore",
current_offset = as.character(as.integer(current_offset)),
page_id = "22332",
btn = "btn btn-gray",
active_filter = "latest"
),
encode = "form"
) -> res
list(
links = httr::content(res) %>%
html_nodes("div.item-content") %>%
html_attr("onclick") %>%
gsub("^window.location.href='|'$", "", .),
next_offset = current_offset + 4
)
}
Now, we can run it (since it defaults to the 10 issued in the first View More click):
x <- view_more()
str(x)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/university-pr-misleads-with-claim-that-preliminary-blood-t"| __truncated__ "https://www.healthnewsreview.org/news-release-review/observational-study-on-testosterone-replacement-therapy-fo"| __truncated__ "https://www.healthnewsreview.org/news-release-review/recap-of-lung-cancer-screening-test-relies-on-hyperbole-co"| __truncated__ "https://www.healthnewsreview.org/news-release-review/ties-to-drugmaker-left-out-of-postpartum-depression-drug-study-recap/"
## $ next_offset: num 14
We can pass that new offset to another call:
y <- view_more(x$next_offset)
str(y)
## List of 2
## $ links : chr [1:4] "https://www.healthnewsreview.org/news-release-review/sweeping-claims-based-on-a-single-case-study-of-advanced-c"| __truncated__ "https://www.healthnewsreview.org/news-release-review/false-claims-of-benefit-weaken-news-release-on-experimenta"| __truncated__ "https://www.healthnewsreview.org/news-release-review/contrary-to-claims-heart-scans-dont-save-lives-but-subsequ"| __truncated__ "https://www.healthnewsreview.org/news-release-review/breastfeeding-for-stroke-prevention-kudos-to-heart-associa"| __truncated__
## $ next_offset: num 18
You can do the hard part of scraping the initial article count (it's on the main page) and doing the math to put that in a loop and stop efficiently.
NOTE: If you are doing this scraping to archive the complete site (whether for them or independently) since it's dying at the end of the year, you should comment to that effect and I have better suggestions for that use-case than manual coding in any programming language. There are free, industrial "site preservation" frameworks designed to preserve these types of dying resources. If you just need the article content, then an iterator and custom scraper is likely a 👍🏼 (but, apparently impossible) choice.
NOTE also that the pagination increment of 4 is what the site does when you literally press the button, so this just mimics that functionality.

rvest: "unknown field names" when attempting to set form

I'm attempting to generate a web form to allow me to scrape data.
library(rvest)
url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214"
pg.form <- html_form(html(url))
which returns
pg.form
[[1]]
<form> '<unnamed>' (POST PriceHistory_GetData.cfm)
<input HIDDEN> 'Market_ID': 214
<select> 'Month' [1/12]
<select> 'Year' [0/2]
<input SUBMIT> '': Get Prices
My mistake is to think that I need to set values for the Month and Year fields, but this is a mistake
filled_form <- set_values(pg.form,
Month = "8",
Year = "0")
returns Error: Unknown field names: Month, Year
How do I use rvest to set values in a webform?
From your output, pg.form is actually a list forms rather than a single form. To access the first form either do
set_values(pg.form[[1]], Month="8")
or you can do
pg.form <- html_form(html(pg.session))[[1]]
instead.
lnk3 <- 'http://data.nowgoal.com/history/handicap.htm' #this website content includes the odds price
> sess <- html_session(lnk3)
> f0 <- sess %>% html_form
> f1 <- set_values(f0[[2]], matchdate=dateID[1], companyid1=list(c(3,8,4,12,1,23,24,17,31,14,35,22)))
Warning message:
Setting value of hidden field 'companyid1'.
> s <- submit_form(sess, f1)
Submitting with 'NULL'
Tried to submit an form which is hidden field but sounds doen't work, subnmitting with 'NULL'

Loop returning svalue from combobox

This is my first post here but I am a regular stackoverflow visitor.
For uploading new datasets, I am processing a dataframe in which one column has some typing mistake. I want users to modify the error from a gcombobox, thus the errors and the correct value will be stored and automatically corrected the next time.
# Sample data which includes a wrong countryid
Incorrect_Country = data.frame(id=c(1,2,3), countryid=c("Canadada", "Peruru", "Chinanan"), othercolumn=c("777", "111", "333"))
# Dataframe where some previous pitfalls have been stored
#(it´s useful because the model can learn from previous pitfalls)
Country_Recode = data.frame(id=c(1,2,3), Remote.Name=c("Frankekz", "Potuugal", "Mexxico"), Name=c("France", "Portugal", "Mexico"))
# This table presents values for the combobox
Master_Country = data.frame(name=c("Canada", "Peru", "China", "France", "Portugal", "Mexico"))
This is the code: ( GUI toolkit :gWidgetstcltk)
# Define errors in country
Rewrite_Country = unique(sqldf("SELECT * FROM Incorrect_Country WHERE countryid NOT IN (SELECT 'Remote.Name' FROM Country_Recode)"))
B <- 0
# Dataframe where to store the wrong names which the respective correction
error <- data.frame(x=integer(0), y= character(0), z = character(0))
# Loop for each row with typing errors
for (i in Rewrite_Country["countryid"]) {
B <- B + 1
# I create a dialog for preventing several windows to pop up
# as this produced that the returned value from a combobox was assigned to the wrong recode name
gconfirm("New Value not specified. Do u want to change it?", handler=function(h,...){
# I create the window which will include a combobox of correct values
w <- gwindow("Recode Country for:")
gp <- ggroup(container=w)
## A group for the message and buttons
i.gp <- ggroup(horizontal=FALSE, container = gp)
glabel(i, container=i.gp)
## Combobox including the correct names
cb <- gcombobox(Master_Country[["name"]], selected=0, container=i.gp)
addHandlerChanged(cb,handler=function(h,...) {
# I assign the combobox's svalue to a new global variable
aNew <- as.character()
assign("aNew", svalue(cb), envir = as.environment(1))
print(svalue(cb))
})
## A group to organize the buttons
button.group <- ggroup(container = i.gp)
## Push buttons to right
addSpring(button.group)
# Ok Button for storing the resuts: (index, wrong value, correct value)
button <- gbutton("ok", handler = function(h,...) {
error <- rbind(error, c(B,i, aNew))
# In one of the last tries I set the new environment for the table
assign("error", error, envir = as.environment(1))
print(error)
dispose(w)
}, container = button.group)
gbutton("cancel", handler = function(h,...) dispose(w), container=button.group)
})
}
I don't get my expected outcome. I find very hard to retrieve the svalue from the combobox and impossible to store several results from the variable "aNew" when running the loop. Also happens these other two incidents:
1 - when I run the code including the loop: It does not "use to!" enter the widgets (confirm popup)
2 - The loop exists after disposing the first "Recode Country" window, so to say, processing "canadada"
What I really want is that the user can fix the errors in the data.frame Incorrect_Country. Then the error and the solution are stored (data frame: error) for the program to know how to deal with it for future uploads.
How it should work:
1- confirm window (for stopping the loop till the user has corrected the previous error)
2- pop up shows error "canadada"
3- user selects from combobox "canada"
4- Pressing ok will store an integer, the error, and the corrected name in the table error
5- The loop runs again (press confirm and shows "Peruru")
6- Finally I get the error table such as
x, y, z
1, canadada, Canada
2, Perurur, Peru
3, chinanan, China
Any advice would be appreciated. Thanks

Resources