I have been trying to scrape a website for tables with Rvest.
Currently I can get to a table, but it only pulls the headers of the table but not the content. I haven't figured out how to complete resolve this, but I've tried a few similar solutions. (How to get table using rvest())
I have already figured out how to submit to a single button without appropriate unique identifiers. Done through: How to submit login form in Rvest package w/o button argument, https://github.com/hadley/rvest/issues/156
Here's my operating script:
library(rvest)
library(httr)
library(R.utils)
url<-'https://itmdapps.milwaukee.gov/publicApplication_QD/zipcode.jsp'
##############################
#fix for package available online. https://github.com/hadley/rvest/issues/156
custom.submit_request <-
function (form, submit = NULL)
{
submits <- Filter(function(x) {
identical(tolower(x$type), "submit")
}, form$fields)
nsubmits <- Filter(function(x) {
!identical(tolower(x$type), "submit")
}, form$fields)
# if list take name and vakue as inputs
if (is.list(submit)) {
submits[[1]]$name <- names(submit)[1]
submits[[1]]$value <- submit[[1]]
submit <- submits[[1]]
}
# if character filter by name
if (is.character(submit)){
submit <- Filter(function(x){x$name==submit},submits)[[1]]
}
# if null choose first
if (is.null(submit)) {
submit <- submits[[1]]
message("Submitting with '", submit$name, "'")
}
# handle method
method <- form$method
if (!(method %in% c("POST", "GET"))) {
warning("Invalid method (", method, "), defaulting to GET",
call. = FALSE)
method <- "GET"
}
# url
url <- form$url
# fields
fields <- nsubmits
fields[submit$name] <- list(submit)
fields <- Filter(function(x) length(x$value) > 0, fields)
values <- rvest::pluck(fields, "value")
names(values) <- names(fields)
# return
list(
method = method,
encode = form$enctype,
url = url,
values = values
)
}
reassignInPackage('submit_request', 'rvest', custom.submit_request)
#####################################
target_zip_code_position<-2
webpage.session <- html_session(url) #start website
form<-html_form(webpage.session) #here's the form.
#log in!
form #let's look at it. We have to log in!
filled_form<-form #create a copy to fill so we don't ruin the original.
filled_form[[2]]<-set_values(filled_form[[2]],
username = "address",
password = "user") #fill forms
filled_form #how does it look?
#filled_form[[2]]$url<-"" #URL needs to be cleared to prevent error message when submitting.
logged_in.session<-submit_form(session = webpage.session,
form = filled_form[[2]]) # defaults to first submission button with message.
#We have successfully logged in.
zip_search.session<-jump_to(logged_in.session,url) #navigate to the page with the query we want.
zip_search.form<-html_form(zip_search.session)
zip_search.form_filled<-zip_search.form
zip_search.form_filled[[2]]<-set_values(zip_search.form_filled[[2]],
zipcode = target_zip_code_position,
format = 1,
startDate = "01/01/2005",
endDate = "01/01/2006"
)
list_submit<-list('WIBR Detailed')
names(list_submit)<-c('submit') #Very bizzare submit approach.
output.session<- submit_form(session = zip_search.session,
form = zip_search.form_filled[[2]],
submit = list_submit
) #how does it know which one? Requires fancy submit technique here. Now works.
### we have sent a query
#output.read_html<-read_html(output.session)
#output_table<-html_table(output.session, fill = TRUE)[[1]] #no rows.
table_node<-html_node(output.session,'div.main div.content:nth-child(5) table.bordered:nth-child(1)')
html_text(table_node) #no rows still. Only selects proper table.
html_table(table_node)
Related
I have a vast list of chemicals for that I need to extract the CAS number. I have written a for loop which works as intended. However, when a chemical name is not found on the website, my code obviously stops.
Is there a way to account for this in the for loop? So that when a search query is not found, the loop goes back to the start page and searches for the next item in the list?
Down below is my code for the for loop with a short list of names to search for:
library(RSelenium)
library(netstat)
# start the server
rs_driver_object <- rsDriver(browser = "firefox",
verbose = FALSE,
port = 4847L) # change number if port is not open
# create a client object
remDrCh <- rs_driver_object$client
items <- c("MCPA", "DEET", "apple")
numbers <- list()
for (i in items) {
Sys.sleep(2)
remDrCh$navigate("https://commonchemistry.cas.org/")
search_box <- remDrCh$findElement(using = 'class', 'search-input')
search_box$sendKeysToElement(list(paste(i), key = 'enter'))
Sys.sleep(2)
result <- remDrCh$findElement(using = "class", "result-content")
result$clickElement()
Sys.sleep(2)
cas <- remDrCh$findElements(using = 'class', 'cas-registry-number')
cas_n <- lapply(cas, function (x) x$getElementText())
numbers[[i]] <- unlist(cas_n)
Sys.sleep(2)
remDrCh$navigate("https://commonchemistry.cas.org/")
Sys.sleep(2)
}
The problem lies in the result <- remDrCh$findElement(using = "class", "result-content") part. For "apple" there is no result, and thus no element that R could use.
I tried to write a separate if else argument for that specific part, but to no avail.
This still only works for queries that yield a result. I also tried to use findElements but this only helps for the case when no result is found.
result <- remDrCh$findElement(using = "class", "result-content")
if (length(result) > 0) {
result$clickElement()
} else {
remDrCh$navigate("https://commonchemistry.cas.org/")
}
I also tried to use this How to check if an object is visible in a webpage by using its xpath? but I cannot get it to work on my example.
Any help would be much appreciated!
This should work
items <- c("MCPA", "apple", "DEET")
numbers <- list()
for (i in items) {
Sys.sleep(2)
remDrCh$navigate("https://commonchemistry.cas.org/")
search_box <- remDrCh$findElement(using = 'class', 'search-input')
search_box$sendKeysToElement(list(paste(i), key = 'enter'))
Sys.sleep(2)
result <- try(remDrCh$findElement(using = "class", "result-content"))
if(!inherits(result, "try-error")){
result$clickElement()
Sys.sleep(2)
cas <- remDrCh$findElements(using = 'class', 'cas-registry-number')
cas_n <- lapply(cas, function (x) x$getElementText())
numbers[[i]] <- unlist(cas_n)
}else{
numbers[[i]] <- NA
}
Sys.sleep(2)
remDrCh$navigate("https://commonchemistry.cas.org/")
Sys.sleep(2)
}
Note the try() wrapper around the problematic code:
result <- try(remDrCh$findElement(using = "class", "result-content"))
This will capture the error if there is one, but allow the loop to continue. Then, there is an if statement that tries to find the result if the output from try is not of class "try-error" otherwise, it returns the number as NA.
I'm writing a wrapper for the YouTube Analytics API, and have created a function as follows:
yt_request <- function(dimensions = NULL, metrics = NULL, sort = NULL,
maxResults = NULL, filtr = NULL, startDate = Sys.Date() - 30,
endDate = Sys.Date(), token) {
url <- paste0("https://youtubeanalytics.googleapis.com/v2/reports?",
"&ids=channel%3D%3DMINE",
"&startDate=", startDate,
"&endDate=", endDate)
if(!is.null(dimensions)) url <- paste0(url, "&dimensions=", dimensions)
if(!is.null(metrics)) url <- paste0(url, "&metrics=", metrics)
if(!is.null(sort)) url <- paste0(url, "&sort=", sort)
if(!is.null(maxResults)) url <- paste0(url, "&maxResults=", maxResults)
if(!is.null(filtr)) url <- paste0(url, "&filters=", filtr)
r <- GET(url, token)
return(r)
}
This is meant to just be a flexible but not the most friendly of functions because I want to have wrapper functions that will contain yt_request() that will be much more user friendly. For example:
top_videos <- function(...) {
dim <- "video"
met <- "views,averageViewDuration"
maxRes <- 10
temp <- yt_request(dimensions = dim, metrics = met, maxResults = maxRes, token = myToken)
return(temp)
}
Which so far works fine and dandy, but I also want potential users to have a little flexibility with the results. For example, if they want to have maxResults <- 20 instead of 10 or they want different metrics than the ones I specify, I want them to be able to pass their own arguments in the ... of top_videos(...).
How can I do a check if someone passes an argument in the ellipsis? If they pass a metric, I want it to override the default I specify, otherwise, go with the default.
EDIT
To help clarify, I'm hoping that when the user decides to use the function, they could just write something like top_videos(maxResults = 20) and the function would ignore the line maxRes <- 10 and in the yt_request() function would assign maxResults = 20 instead of 10
We can capture the ... in a list and convert the whole elements to a key/value pair. Then, extract the elements based on the name. If we are not passing that particular named element, it will return NULL. We make use of this behavior of NULL to concatenate with the default value of 10 in maxRes and select the first element ([1]) so that if it is NULL, the default 10 is selected, or else the value passed will be selected. Likewise, do this on all those objects that the OP wanted to override
top_videos <- function(...) {
nm1 <- list(...)
lst1 <- as.list(nm1)
dim <- c(lst1[["dimensions"]], "video")[1]
met <- c(lst1[["metrics"]], "views,averageViewDuration")[1]
maxRes <- c(lst1[['maxResults']], 10)[1]
#temp <- yt_request(dimensions = dim,
metrics = met, maxResults = maxRes, token = myToken)
#temp
maxRes
}
-testing
top_videos(maxResults = 20)
#[1] 20
top_videos(hello = 5)
#[1] 10
I am learning data scraping and, on top of that, I am quite a debutant with R (for work I use STATA, I use R only for very specific tasks).
In order to learn scraping, I am exercising with a few pages on Psychology Today.
I have written a function that allows me to scrape information for one therapist and to create a data set with the information collected in this way:
install.packages('rvest') #Loading the rvest package
install.packages('xml2') #Loading the xml2 package
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
#Specifying the url for desired website to be scraped
url <- 'https://www.psychologytoday.com/us/therapists/THE_ONE_YOU_WANT'
#Reading the HTML code from the website
URL <- read_html(url)
#creating the function
getProfile <- function(profilescrape) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
##Structure of the data frame
str(onet_df)
}
View(onet_df)
This code seems to be working well for whatever therapist I choose.
Now, I would like to use this function on multiple profiles, to generate one data set, with name and modality of MHPs.
Let's say that I want to apply the above function "getProfile" to the first 20 therapists in Illinois and input the information for this 20 therapists in a data set called "onet_df"
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile <- function(profilescrape)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
EDITS START HERE:
This loop does not create any data set; moreover, it does not give any error message.
Would someone be able to help me de-bug this loop?
Please, keep in mind that I am a real beginner.
Following sueggetions, I have modified what follows at the beginning:
#creating the function
getProfile <- function(URL) {....}
Moreover, I have used three alternative loops:
1st alternative
j <- 1
MHP_codes <- c(324585 : 449807) #therapist identifier
withinpage_codes <- c(1 : 20) #therapist running number
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
record_profile <- getProfile(URL)
onet_df[[j]] <- rbind.fill(onet_df, record_profile)
j <- j + 1
}
}
which gives the followin errors message:
Error in UseMethod("xml_find_first") :
no applicable method for 'xml_find_first' applied to an object of class "character"
2nd alternative
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpage_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- getProfile(URL)
j <- j + 1
}
}
final_df <- rbind.fill(df_list)
This loop gives the same error message (please, refer to the above one).
Now, I have just to figure out why no data set is produced with the loop. There might be two problems: First, something within the loop does not work (I have run both loops on only one existing page and no data set is produced) ; Second, when I run the loop on a series of link, some of them might be missing, which would produce an error message.
Consider several adjustments:
Adjust function to receive a URL parameter. Right profilescrape is not used anywhere in function. Function takes whatever URL is assigned in global environment.
getProfile <- function(URL) {
...
}
Adjust the ending of function to return the needed object. Without return, R will return the last line read. Therefore, replace str(onet_df) with return(onet_df).
Pass dynamic URL in loop to method without calling function:
URL <- paste0(...)
record_profile <- getProfile(URL)
Initialize a list with specified length (2 x 20) before loop. Then on each iteration assign to loop index rather than growing object in loop which is memory inefficient.
MHP_codes <- c(324585, 449807) #therapist identifier
withinpage_codes <- c(1:20) #therapist running number
df_list <- vector(mode = "list",
length = length(MHP_codes) * length(withinpade_codes))
j <- 1
for(code1 in withinpage_codes) {
for(code2 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
}
Call rbind.fill once outside loop to combine all data frames together
final_df <- rbind.fill(df_list)
With that said, consider an apply family solution, specifically Map (wrapper to mapply). Doing so, you avoid the bookkeeping of initializing list and incremental variable and you "hide" the loop for compact statement.
# ALL POSSIBLE PAIRINGS
web_codes_df <- expand.grid(MHP_codes = c(324585, 449807),
withinpage_codes = c(1:20))
# MOVE URL ASSIGNMENT INSIDE FUNCTION
getProfile <- function(code1, code2) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code2, '?sid=5d87f874630bd&ref=', code1, '&rec_next=1&tr=NextProf')
# ...same code as before...
}
# ELEMENT-WISE LOOP PASSING PARAMS IN PARALLEL TO FUNCTION
df_list <- Map(function(code1, code2) tryCatch(getProfile(code1, code2),
error = function(e) NULL),
code1 = web_codes_df$MHP_codes,
code2 = web_codes_df$withinpage_codes)
final_df <- rbind.fill(df_list)
One of the users, Parfait, helped me to sort out the issues. So, a very big thank you goes to this user.
Below I post the script. I apologize if it is not presicely commented.
Here is the code.
#Loading packages
library('rvest') #to scrape
library('xml2') #to handle missing values (it works with html_node, not with html_nodes)
library('plyr') #to bind together different data sets
#get working directory
getwd()
setwd("~/YOUR OWN FOLDER HERE")
#DEFINE SCRAPING FUNCTION
getProfile <- function(URL) {
##NAME
#Using CSS selectors to name
nam_html <- html_node(URL,'.contact-name')
#Converting the name data to text
nam <- html_text(nam_html)
#Let's have a look at the rankings
head(nam)
#Data-Preprocessing: removing '\n' (for the next informations, I will keep \n, to help
# me separate each item within the same type of
# information)
nam<-gsub("\n","",nam)
head(nam)
#Convering each info from text to factor
nam<-as.factor(nam)
#Let's have a look at the name
head(nam)
#If I need to remove blank space do this:
#Data-Preprocessing: removing excess spaces
#variable<-gsub(" ","",variable)
##MODALITIES
#Using CSS selectors to modality
mod_html <- html_node(URL,'.attributes-modality .copy-small')
#Converting the name data to text
mod <- html_text(mod_html)
#Let's have a look at the rankings
head(mod)
#Convering each info from text to factor
mod<-as.factor(mod)
#Let's have a look at the rankings
head(mod)
##Combining all the lists to form a data frame
onet_df<-data.frame(Name = nam,
Modality = mod)
return(onet_df)
}
Then, I apply this function with a loop to a few therapists. For illustrative purposes, I take four adjacent therapists' ID, without knowing apriori whether each of these IDs have been actually assigned (this is done because I want to see what happens if the loop stumbles on a non-existen link).
j <- 1
MHP_codes <- c(163805:163808) #therapist identifier
df_list <- vector(mode = "list", length(MHP_codes))
for(code1 in MHP_codes) {
URL <- paste0('https://www.psychologytoday.com/us/therapists/illinois/', code1)
#Reading the HTML code from the website
URL <- read_html(URL)
df_list[[j]] <- tryCatch(getProfile(URL),
error = function(e) NULL)
j <- j + 1
}
final_df <- rbind.fill(df_list)
save(final_df,file="final_df.Rda")
I am getting stuck trying to incorporate the "pages" element here, along with the "type" - the pages element changes but the type remains constant.
The current code I have maps over the tickers and downloads the data into a list format, however it only uses the default number of pages, how can I map over the tickers as well as the pages?
Adding company_filings(pages, type) doesn´t work.
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type = "10-K"
custom_company_filings <- function(compfilings){
company_filings(compfilings)
}
x <- Map(custom_company_filings, tickers)
Edit:
What I am trying to achieve is to do something like:
for(tick in tickers){
for(i in pages){
company_filings("get company filings")
}
}
So for each ticker I Will collect pages 1,2,3,4 of the company_filing() function - which comes from the edgarWebr package.
The custom_company_filing() function was my attempt at trying to solve the problem but when I posted here I removed all my attempts. For instance one attempt was:
custom_company_filings <- function(compfilings, pages){
company_filings(compfilings, pages)
}
x <- Map(custom_company_filings(page = pages), tickers)
We can use a nested lapply:
library(edgarWebR)
tickers <- c("63908", "793952")
count <- 100
pages <- 1:4
type <- "10-K"
lapply(tickers, function(x){
lapply(pages, function(y){
company_filings(x, type = type, count = count, page = y)
})
})
Or with purrr:
library(purrr)
pmap(expand.grid(tickers, pages), ~company_filings(..1, type = type, count = count, page = ..2))
I am trying to scrape a web page that requires authentication using html_session() & html_form() from the rvest package.
I found this e.g. provided by Hadley Wickham, but am not able to customize it to my case.
united <- html_session("http://www.united.com/")
account <- united %>% follow_link("Account")
login <- account %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(
`ctl00$ContentInfo$SignIn$onepass$txtField` = "GY797363",
`ctl00$ContentInfo$SignIn$password$txtPassword` = password)
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
In my case, I can't find the values to set in the form, hence I am trying to give the user and pass directly:
set_values("email","password")
I also don't know how to refer to submit button, so I tried:
submit_form(account,login)
The error I got for the submit_form function is:
Error in names(submits)[[1]] : subscript out of bounds
Any idea on how to go about this is appreciated.
Thank you
Currently, this issue is the same as the open issue #159 in the rvest package, which causes issues where not all fields in a form have a type value. This buy may be fixed in a future release.
However, we can work around the issue by monkey patching the underlying function rvest:::submit_request.
The core problem is the helper function is_submit. Initially, it's defined like this:
is_submit <- function(x) tolower(x$type) %in% c("submit",
"image", "button")
As logical as this is, however, it fails in two scenarios:
There is no type element.
The type element is NULL.
Both of these happen to occur on the United login form. We can resolve this by adding two checks inside the function.
custom.submit_request <- function (form, submit = NULL)
{
is_submit <- function(x) {
if (!exists("type", x) | is.null(x$type)){
return(F);
}
tolower(x$type) %in% c("submit", "image", "button")
}
submits <- Filter(is_submit, form$fields)
if (length(submits) == 0) {
stop("Could not find possible submission target.", call. = FALSE)
}
if (is.null(submit)) {
submit <- names(submits)[[1]]
message("Submitting with '", submit, "'")
}
if (!(submit %in% names(submits))) {
stop("Unknown submission name '", submit, "'.\n", "Possible values: ",
paste0(names(submits), collapse = ", "), call. = FALSE)
}
other_submits <- setdiff(names(submits), submit)
method <- form$method
if (!(method %in% c("POST", "GET"))) {
warning("Invalid method (", method, "), defaulting to GET",
call. = FALSE)
method <- "GET"
}
url <- form$url
fields <- form$fields
fields <- Filter(function(x) length(x$value) > 0, fields)
fields <- fields[setdiff(names(fields), other_submits)]
values <- pluck(fields, "value")
names(values) <- names(fields)
list(method = method, encode = form$enctype, url = url, values = values)
}
To monkey patch, we need to use the R.utils package (install via install.packages("R.utils") if you don't have it).
library(R.utils)
reassignInPackage("submit_request", "rvest", custom.submit_request)
From there, we can issue our own request.
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")
And that works!
(Well, "works" is a misnomer. Due to United employing more aggressive authentication requirements -- including known browsers -- this results in a 301 Unauthorized. However, it fixes the error).
A full reproducible example involved a couple of other minor code changes:
library(magrittr)
library(rvest)
url <- "https://www.united.com/web/en-US/apps/account/account.aspx"
account <- html_session(url)
login <- account %>%
html_nodes("form") %>%
extract2(1) %>%
html_form() %>%
set_values(
`ctl00$ContentInfo$SignIn$onepass$txtField` = "USER",
`ctl00$ContentInfo$SignIn$password$txtPassword` = "PASS")
account <- account %>%
submit_form(login, "ctl00$ContentInfo$SignInSecure")