I'm attempting to generate a web form to allow me to scrape data.
library(rvest)
url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214"
pg.form <- html_form(html(url))
which returns
pg.form
[[1]]
<form> '<unnamed>' (POST PriceHistory_GetData.cfm)
<input HIDDEN> 'Market_ID': 214
<select> 'Month' [1/12]
<select> 'Year' [0/2]
<input SUBMIT> '': Get Prices
My mistake is to think that I need to set values for the Month and Year fields, but this is a mistake
filled_form <- set_values(pg.form,
Month = "8",
Year = "0")
returns Error: Unknown field names: Month, Year
How do I use rvest to set values in a webform?
From your output, pg.form is actually a list forms rather than a single form. To access the first form either do
set_values(pg.form[[1]], Month="8")
or you can do
pg.form <- html_form(html(pg.session))[[1]]
instead.
lnk3 <- 'http://data.nowgoal.com/history/handicap.htm' #this website content includes the odds price
> sess <- html_session(lnk3)
> f0 <- sess %>% html_form
> f1 <- set_values(f0[[2]], matchdate=dateID[1], companyid1=list(c(3,8,4,12,1,23,24,17,31,14,35,22)))
Warning message:
Setting value of hidden field 'companyid1'.
> s <- submit_form(sess, f1)
Submitting with 'NULL'
Tried to submit an form which is hidden field but sounds doen't work, subnmitting with 'NULL'
Related
I'm having an issue when using rvest to scrape 466 pages from a wiki. Each page represents a metric that I need further information about. I have the following code which loops through each link (loaded from a csv file) and extracts the information I need from a html table on each page.
Metrics <- read.csv("C:\\Users\\me\\Documents\\WebScraping\\LONMetrics.csv")
Metrics$Theme <- as.character(paste0(Metrics$Theme))
Metrics$Metric <- as.character(paste0(Metrics$Metric))
Metrics$URL <- as.character(paste0(Metrics$URL))
n = nrow(Metrics)
i = 1
while (i <= n) {
webPage <- read_html(Metrics$URL[i])
pageTable <- html_table(webpage)
Metrics$Definition[i] <- pageTable[[1]]$X2[1]
Metrics$Category[i] <- pageTable[[1]]$X2[2]
Metrics$Calculation[i] <- pageTable[[1]]$X2[3]
Metrics$UOM[i] <- pageTable[[1]]$X2[4]
Metrics$ExpectedTrend[i] <- pageTable[[1]]$X2[6]
Metrics$MinTech[i] <- pageTable[[1]]$X2[7]
i = i+1
}
The problem I'm having is that it stops returning data after 32 pages giving an error as:
Error in read_connection_(x, n) :
Evaluation error: Failure when receiving data from the peer
I'm wondering what the cause may be and how to get around this seeming limitation?
Thanks.
Rob
I'm using the R package rnoaa(along with it required other packages) to gather historical weather data. I wrote this nestled loop to gather all the data sets but I keep getting errors when I run it. It seems to run for a second fine
The loop:
require('triebeard')
require('bindr')
require('colorspace')
require('mime')
require('curl')
require('openssl')
require('R6')
require('urltools')
require('httpcode')
require('stringr')
require('assertthat')
require('bindrcpp')
require('glue')
require('magrittr')
require('pkgconfig')
require('rlang')
require('Rcpp')
require('BH')
require('plogr')
require('purrr')
require('stringi')
require('tidyselect')
require('digest')
require('gtable')
require('plyr')
require('reshape2')
require('lazyeval')
require('RColorBrewer')
require('dichromat')
require('munsell')
require('labeling')
require('viridisLite')
require('data.table')
require('rjson')
require('httr')
require('crul')
require('lubridate')
require('dplyr')
require('tidyr')
require('ggplot2')
require('scales')
require('XML')
require('xml2')
require('jsonlite')
require('rappdirs')
require('gridExtra')
require('tibble')
require('isdparser')
require('geonames')
require('hoardr')
require('rnoaa')
install.package('ncdf4')
install.packages("devtools")
library(devtools)
install_github("rnoaa", "ropensci")
library(rnoaa)
list <- buoys(dataset='wlevel')
lid <- data.frame(list$id)
foo <- for(range in 1990:2017){
for(bid in lid){
bid_range <- buoy(dataset = 'wlevel', buoyid = bid, year = range)
bid.year.data <- data.frame(bid.year$data)
write.csv(bid.year.data, file='cwind/bid_range.csv')
}
}
The response:
Using c1990.nc
Using
Error: length(url) == 1 is not TRUE
It saves the first data-set but it does not apply the for in the file name it just names it bid_range.csv.
This error message shows that there are no any data of a given station id in 1990. Because you were using for loop, once it gots an error, it stops.
Here I introduce the use of tidyverse to download the NOAA buoy data. A lot of the following functions are from the purrr package, which is part of the tidyverse.
# Load packages
library(tidyverse)
library(rnoaa)
Step 1: Create a "Grid" containing all combination of id and year
The expand function from tidyr can create the combination of different values.
data_list <- buoys(dataset = 'wlevel')
data_list2 <- data_list %>%
select(id) %>%
expand(id, year = 1990:2017)
Step 2: Create a "safe" version that does not break when there is no data.
Also make this function suitable for the map2 function
Because we will use map2 to loop through all the combination of id and year using the map2 function by its .x and .y argument. We modified the sequence of argument to create buoy_modify. We also use the safely function to create a safe version of buoy_modify. Now when it meets error, it will store the error message and moves to the next one rather than breaks.
# Modify the buoy function
buoy_modify <- function(buoyid, year, dataset, ...){
buoy(dataset, buoyid = buoyid, year = year, ...)
}
# Creare a safe version of buoy_modify
buoy_safe <- safely(buoy_modify)
Step 3: Apply the buoy_safe function
wlevel_data <- map2(data_list2$id, data_list2$year, buoy_safe, dataset = "wlevel")
# Assign name for the element in the list based on id and year
names(wlevel_data) <- paste(data_list2$id, data_list2$year, sep = "_")
After this step, all the data were downloaded in wlevel_data. Each element in wlevel_data has two parts. $result shows the data if the download is successful, otherwise, it shows NULL. $error shows NULL if the download is successful, otherwise, it shows the error message.
Step 4: Access the data
transpose can turn a list "inside out". So now wlevel_data2 has two elements: result and error. We can store these two and access the data.
# Turn the list "inside out"
wlevel_data2 <- transpose(wlevel_data)
# Get the error message
wlevel_error <- wlevel_data2$error
# Get he result
wlevel_result <- wlevel_data2$result
# Remove NULL element in wlevel_result
wlevel_result2 <- wlevel_result[!map_lgl(wlevel_result, is.null)]
I was trying to convert an xml into r df using XML package. Was able to get a df successfully, but whenever there were grandchildren under a child, values of grandchildren was merged into one column.
Here is how the xml looks like:
<user>
<created-at type="datetime">2012-12-20T18:32:20+00:00</created-at>
<details></details>
<is-active type="boolean">true</is-active>
<last-login type="datetime">2017-06-22T16:52:11+01:00</last-login>
<time-zone>Pacific Time (US & Canada)</time-zone>
<updated-at type="datetime">2017-06-22T21:00:47+01:00</updated-at>
<is-verified type="boolean">true</is-verified>
<groups type="array">
<group>
<created-at type="datetime">2015-02-09T09:34:41+00:00</created-at>
<id type="integer">23215935</id>
<is-active type="boolean">true</is-active>
<name>Product Managers</name>
<updated-at type="datetime">2015-02-09T09:34:41+00:00</updated-at>
</group>
</groups>
</user>
The code I used were:
users_xml = xmlTreeParse("users.xml")
top_users = xmlRoot(users_xml)
users = xmlSApply(top_users, function(x) xmlSApply(x, xmlValue))
The result I got had all the elements listed fine besides it combined everything under "groups" into one column. Is there anyway I can make each element under "group" a separate column in the final dataframe?
I also tried
nodes=getNodeSet(top_users, "//groups[#group]")
and
nodes=getNodeSet(top_users, "//groups/group[#group]")
and
nodes=getNodeSet(top_users, "//.groups/group[#group]")
and switched "top_users" to "user_xml", but each time got error message:
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x3C 0x2F 0x6E
Then tried
data.frame(t(xpathSApply(xmlRoot(xmlTreeParse("users.xml", useInternalNodes = T)),
"//user", function(y) xmlSApply(y, xmlValue))))
Which gave me the exact same thing as the first solution.
And finally, I tried
data.frame(t(xpathSApply(xmlRoot(xmlTreeParse("users.xml", useInternalNodes = T)),
"//user/groups/group", function(y) xmlSApply(y, xmlValue))))
Which did give me a dataframe but only with elements in "group", and there is no way I can map it back to the first table I got that has all elements in "user".
Consider column binding with xmlToDataFrame() of user children and groups children:
userdf <- xmlToDataFrame(nodes=getNodeSet(doc, "/user"))
groupdf <- xmlToDataFrame(nodes=getNodeSet(doc, "/user/groups/group"))
df <- transform(cbind(userdf, groupdf), groups = NULL) # REMOVE groups COL
df
# created.at details is.active last.login time.zone
# 1 2012-12-20T18:32:20+00:00 true 2017-06-22T16:52:11+01:00 Pacific Time (US & Canada)
# updated.at is.verified created.at.1 id is.active.1 name
# 1 2017-06-22T21:00:47+01:00 true 2015-02-09T09:34:41+00:00 23215935 true Product Managers
# updated.at.1
# 1 2015-02-09T09:34:41+00:00
I want to get Google analytic data from a specific list of cardnumbers. The component ga:dimension10 contains the cardnumbers. The following code works:
ga_datasubset <- subset(get_ga(id, Startdatum, Einddatum,
metrics = c("ga:sessions", " ga:pageviews","ga:sessionDuration"),
dimensions="ga:dimension10, ga:deviceCategory, ga:medium",
fetch.by ="day"),
dimension10 %in% Datatest[,1])
But I want to make this code without using the subset function. I tried the code below, but this doesn’t work.
ga_datasubset <- get_ga(id, Startdatum, Einddatum,
metrics = c("ga:sessions", " ga:pageviews","ga:sessionDuration"),
dimensions="ga:dimension10, ga:deviceCategory, ga:medium",
filters ="ga:dimension10 %in% Datatest[,1]" ,
fetch.by ="day")
Error: Invalid parameter: Invalid value 'ga:dimension10 %in% Datatest[,1]' for filters parameter.
Any help will be greatly appreciated
Adapting this SO answer, I'm trying to use rvest to generate a form to scrape the resulting page. I keep coming up with an error.
library(rvest)
url <- "https://iemweb.biz.uiowa.edu/pricehistory/pricehistory_SelectContract.cfm?market_ID=214"
pg.session <- html_session(url)
pg.form <- html_form(html(pg.session))
filled_form <- set_values(pg.form[[1]],
Month = "8",
Year = "1")
out <- submit_form(session = pg.session, pg.form)
returns this error
Submitting with ''
Error in if (!(submit %in% names(submits))) { :
argument is of length zero
What am I doing wrong?
Well, for one thing, you are not submitting the form you actually filled in and you are also attempting to pass in a list of forms rather than a form, but also it appears there may be a bug in the code that doesn't recognize submit buttons with upper case tags. In this case, the HTML has the code
<INPUT TYPE="SUBMIT" VALUE="Get Prices">
and the submit_form codes calls submit_request which looks for submit buttons via
submits <- Filter(function(x) identical(x$type, "submit"),
form$fields)
and since it checks for values identical to "submit" it's not finding "SUBMIT"
sapply(pg.form[[1]]$fields, function(x) x$type)
# $Market_ID
# [1] "HIDDEN"
# $Month
# NULL
# $Year
# NULL
# $`NULL`
# [1] "SUBMIT"
The easiest thing might be to change it ourselves
filled_form <- set_values(pg.form[[1]],
Month = "08",
Year = "2007")
filled_form$fields[[4]]$type <- "submit"
The other problem is that this version has a bug in the way the URL for the form us resolved. we can fix it with
# incorrectly was: url <- XML::getRelativeURL(session$url, form$url)
body(submit_form)[[3]]<-quote(url <- XML::getRelativeURL(form$url, session$url))
And now finally we can submit the request
out <- submit_form(session = pg.session, filled_form)
# out %>% html_table()
(Tested with rvest_0.2.0.9000)