How can I fix this issue in R with webscraping? - r

I am trying to pull across data from within over 800 links and putting it onto a table. I have tried using chrome selector gadget but cannot work out how to get it to loop. I must have spent 40 hours and keep getting error codes. I need to pull the same information from li:nth-child(8) , li:nth-child(8) strong and another couple text boxes of information. I have tried following a YouTube video and I just changed the names and links but otherwise maintained consistency and it just will not work.
library(tidyverse)
library(rvest)
library(htmltools)
library(xml2)
library(dplyr)
results <- read_html("https://www.artemis.bm/deal-directory/")
issuers <- results %>% html_nodes("#table-deal a") %>% html_text()
url <- results %>% html_nodes("#table-deal a") %>% html_attr("href")
get_modelling = function(url_link) {
issuer_page = read_html(url_link)
modelling = issuer_page %>% html_nodes("#info-box li:nth-child(4)") %>%
html_text()
return(modelling)
}
issuer_modelling = sapply(url, FUN = get_modelling)
I get these issues:
Warning message:
In for (i in seq_along(specs)) { :
closing unused connection 4 (https://www.artemis.bm/deal-directory/bellemeade-re-2022-1-ltd/)
Called from: open.connection(x, "rb")
Browse[1]> data.table::data.table(placement = unlist(issue_placement))[,.N, placement]
Error during wrapup: object 'issue_placement' not found
Error: no more error handlers available (recursive errors?); invoking 'abort' restart
Browse[1]> c
> data.table::data.table(placement = unlist(issue_placement))[,.N, placement]
Error in unlist(issue_placement) : object 'issue_placement' not found

We can use simple for loop,
#create empty vector
df = c()
for(i in head(url)){
dd = i %>% read_html() %>% html_nodes("#info-box li:nth-child(4)") %>%
html_text()
df = c(dd, df)
}
df
[1] "Risk modelling / calculation agents etc: AIR Worldwide" "Risk modelling / calculation agents etc: AIR Worldwide"
[3] "Risk modelling / calculation agents etc: RMS" "Risk modelling / calculation agents etc: AIR Worldwide"
[5] "Risk modelling / calculation agents etc: AIR Worldwide" "Risk modelling / calculation agents etc: AIR Worldwide"

Related

Choose command order in a function based on an error [R]

I have three files in a folder with the following names:
./multiqc_data$ ls
file1.json
file2.json
file3.json
When I open the files with the TidyMultiqc package existing NA values in the files might lead to the following error:
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ load_multiqc(file.path(path, .))) #parse them
## the error
Error in parse_con(txt, bigint_as_char) :
lexical error: invalid char in json text.
"mapped_failed_pct": NaN, "paired in
(right here) ------^
I want to create a function to handle this error.
I want every time this error pops up to be able to apply this sed function in all files of the folder.
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
Any ideas how can I achieve this
You could use this wrapper :
safe_load_multiqc <- function(path, file) {
tryCatch(load_multiqc(file.path(path, file)), error = function(e) {
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
# retry
load_multiqc(path, file)
})
}
A good way to handle errors in work pipelines like that is using restarts and withCallingHandlers and withRestarts.
You establish the condition handlers and the recovery protocols (restarts) then you can choose what protocols to use and in which order. Calling handlers allows a much finer control on error conditions than common try-catch.
In the example, I wrote two handlers: removeNaNs (works at folder level) and skipFile (works at file level), if the first fails, the second is executed (simply skipping the file). Of course is an example
I think in your case you can simply run sed in every case, nevertheless, I hope this answer meet your looking for a canonical way
Inspiration and Extra lecture: Beyond Exception Handling: Conditions and Restarts
path <- "../your_path"
# function that does the error_prone task
do_task <- function(path){
files <- dir(path,pattern = "*.json") #locate files
files %>%
map(~ withRestart( # set an alternative restart
load_multiqc(file.path(path, .)), # parsing
skipFile = function() { # if fails, skip only this file
message(paste("skipping ", file.path(path, .)))
return(NULL)
}))
}
# error handler that invokes "removeNaN"
removeNaNHandler <- function(e) tryInvokeRestart("removeNaN")
# error handler that invokes "skipFile"
skipFileHandler <- function(e) tryInvokeRestart("skipFile")
# run the task with handlers in case of error
withCallingHandlers(
condition = removeNaNHandler, # call handler (on generic error)
# condition = skipFileHandler, # if previous fails skips file
{
# run with recovery protocols (can define more than one)
withRestarts({
do_task(path)},
removeNaN = function() # protocol "removeNaN"
{
system(paste("gsed -i 's/NaN/null/g'",paste0(path,"*.json")))
do_task(path) # try again
}
)
}
)
Based on this open github issue, a potential solution provided by Peter Diakumis is to use RJSONIO::fromJSON() in place of jsonlite::read_json(). You could adapt this solution to your use-case by e.g. creating your own load_multiqc() function:
library(RJSONIO)
load_multiqc_bugfix <- function(paths,
plots = NULL,
find_metadata = function(...) {
list()
},
plot_parsers = list(),
sections = "general") {
assertthat::assert_that(all(sections %in% c(
"general", "plot", "raw"
)), msg = "Only 'general', 'plot' and 'raw' (and combinations of those) are valid items for the sections parameter")
# Vectorised over paths
paths %>%
purrr::map_dfr(function(path) {
parsed <- RJSONIO::fromJSON(path)
# The main data is plots/general/raw
main_data <- sections %>%
purrr::map(~ switch(.,
general = parse_general(parsed),
raw = parse_raw(parsed),
plot = parse_plots(parsed, plots = plots, plot_parsers = plot_parsers)
)) %>%
purrr::reduce(~ purrr::list_merge(.x, !!!.y), .init = list()) %>%
purrr::imap(~ purrr::list_merge(.x, metadata.sample_id = .y))
# Metadata is defined by a user function
metadata <- parse_metadata(parsed = parsed, samples = names(main_data), find_metadata = find_metadata)
purrr::list_merge(metadata, !!!main_data) %>%
dplyr::bind_rows()
}) %>%
# Only arrange the columns if we have at least 1 column
`if`(
# Move the columns into the order: metadata, general, plot, raw
ncol(.) > 0,
(.) %>%
dplyr::relocate(dplyr::starts_with("raw")) %>%
dplyr::relocate(dplyr::starts_with("plot")) %>%
dplyr::relocate(dplyr::starts_with("general")) %>%
dplyr::relocate(dplyr::starts_with("metadata")) %>%
# Always put the sample ID at the start
dplyr::relocate(metadata.sample_id),
.
)
}

curl error (Could not resolve host: NA) while scraping in a loop

While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: NA
The code i use is as follows:
This part is for getting the full url's:
#Scrape Galaxus
vec_galaxus<-vector()
i=0
input_galaxus <- input %>%
filter(`Galaxus Artikel`!=0)
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
This is the scraping loop:
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
Sys.sleep(runif(1, min=0.2, max=0.5))
}
where part of my input "input_galaxus2" looks like this:
c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
Does anybody know why suddenly this code gives me the above error message?
Thanks in advance for your responses!
If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.
Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).
Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.
out <- vector("list", length(input_galaxus2))
Prep the session. This may be repeatable depending on if you have authentication or other attributes.
sess <- session(input_galaxus2[1]) #to start the session
Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.
I don't think you need read_html in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.
empties <- which(sapply(out, is.null))
for (i in empties) {
res <- tryCatch({
sess %>%
session_jump_to(input_galaxus2[i]) %>%
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text() %>%
str_extract("[0-9]+") %>%
as.integer()
}, error = function(e) e)
if (inherits(res, "error")) {
warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e)))
# optional
Sys.sleep(3)
} else out[[i]] <- res
}
Note: this assumes that a NULL value means the previous attempt failed, was interrupted, or ... was not attempted. If NULL can be a valid and successful return value from your pull, then you should likely prefill out with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you define empties above.
Using purrr::map instead of loop, without any Sys.sleep().
library(tidyverse)
library(rvest)
df <- tibble(
links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)
get_prices <- function(link) {
link %>%
read_html() %>%
html_nodes(".sc-1aeovxo-1.gvrGle") %>%
html_text2() %>%
str_remove_all("–")
}
df %>%
mutate(price= map(links, get_prices) %>%
as.numeric)
# A tibble: 14 × 2
links price
<chr> <dbl>
1 "https://www.galaxus.ch/15758734" 17.8
2 "https://www.galaxus.ch/7362734" 500.
3 "https://www.galaxus.ch/12073455" 173
4 "https://www.galaxus.ch/20841274" 112
5 "https://www.galaxus.ch/20589944 " 25.4
6 "https://www.galaxus.ch/13595276" 313
7 "https://www.galaxus.ch/16255768" 40
8 "https://www.galaxus.ch/6296373" 62.9
9 "https://www.galaxus.ch/14513900" 539
10 "https://www.galaxus.ch/14465626" 466.
11 "https://www.galaxus.ch/10592707" 63.5
12 "https://www.galaxus.ch/19958785" NA
13 "https://www.galaxus.ch/9858343" 7.3
14 "https://www.galaxus.ch/14513913" 617

How do I use rvest to sort text into different columns?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs, which I'm using to scrape affiliation data. However, each time I try, it gives me the 404 error: Error in open.connection(x, "rb") : HTTP error 404
It must be an issue with my use of sapply because when I test it using an individual ID, it works. Here is the code I'm using:
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))
I actually need to do this for six columns of authors and there are NA values I'd like to skip so if anyone knows how to do that as well, I would be enormously grateful (but not a big deal if I not). Thank you in advance for your help!
EDIT: I have just discovered that the error is in the formula for the websites. Sometimes it should be df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html") and sometimes it should be df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")
Does anyone know how to get R to try both and give me the one that works?
You can have the two links and use try on bottom of them. I am assuming there is only 1 that would give a valid website. Otherwise we can always edit the code to take in everything that works:
library(rvest)
library(purrr)
df = data.frame(id=1:6)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation <- sapply(df$author_reg, function(x){
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempt
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Check the results:
df
id author_reg
1 1 paa6
2 2 paa2
3 3 paa1
4 4 paa8
5 5 pve266
6 6 pya500
affiliation
1 Statistisk SentralbyråGovernment of Norway
2 Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4 Centraal Planbureau (CPB)Government of the Netherlands
5 Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6 Business SchoolSwinburne University of Technology

simmer: reading resources from inside trajectory functions

I want to be able to modify the resource capacity inside trajectory as a function of queue length.
The following (simplified) code below does not work. - When I try to call get_mon_resources(simStore) inside the function, the code crashes with the error:
Error in run_(private$sim_obj, until) :
Expecting a single value: [extent=0].
Thank you for your help.
simStore <- simmer()
fUpdateNumberOfCashiers <- function() {
dtLastRes <- simStore %>% get_mon_resources %>% tail(1)
nCapacityNow <- dtLastRes$capacity # same result with get_capacity(simStore),
nQueueNow <- dtLastRes$queue # same result with get_queue_count(simStore)
print(dtLastRes) # prints empty data-frame !
return (5) # crashes here ! (eventually 5 will be replaced with more meaningful formula
}
trajClient <- trajectory("Client's path") %>%
log_("Arrived to cashier") %>%
set_capacity("Cashier", value = fUpdateNumberOfCashiers ) %>%
seize("Cashier") %>%
timeout(function() {rexp(1, 30)}) %>% # One Cashier processes 30 clients / hour
release("Cashier") %>%
log_(function(attr) { sprintf("In total spent %.2f", now(simStore) - attr["start_time"])})
simStore <- simmer("Store") %>%
add_resource("Cashier", 1) %>%
add_generator("Store Clients", trajClient, function() {rexp(1, 120)}) %>% # 120 clients / hour
run(until=nHoursObserved <- 1) ; simStore
See the discussion related to troubleshooting this problem here: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!topic/simmer-devel/NgIikOpHpss
What causes the problem is that the other package (lubridate) masks objects from "simmer", as written seen below:
Attaching package: ‘lubridate’
The following objects are masked from ‘package:simmer’:
now, rollback
Once I replaced
library(simmer); library(lubridate);
with
library(lubridate); library(simmer);
The problem disappeared!

Discrete Event simulation in R

From google am trying to understand "simmer" package in R for discrete event simulation.
As per the example given in the link, am trying to create the simulator using create_simulator function. but am getting the below error
Code Used:
trajectory <- read.table(header=T, text= "event_id description resource amount duration successor 1 registration administration 1 runif(1,3,10) 2 2 intake nurse 1 runif(1,10,20) 3 3 consultation doctor 1 runif(1,5,15) NA" )
sim <- create_simulator(name = "SuperDuperSim") %>% add_trajectory(name = "simple_trajectory", trajectory_df = trajectory) %>% add_resource(name = "administration", capacity = 1) %>% add_resource(name = "nurse", capacity = 1) %>% add_resource(name = "doctor", capacity = 2) %>% add_entities_with_interval(n = 10, name_prefix = "patient", trajectory_name = "simple_trajectory", interval = "rnorm(1,10)") %>% replicator(15)
Error:
Error in eval(expr, envir, enclos) :
could not find function "create_simulator"
I have never used this package before but it appears to me that the create_simulator function does not exist in the current version of the package (the add_trajectory function doesn't seem to exist too). Information about the simmer package is available at https://github.com/r-simmer/simmer. More specifically, an introduction is given at: https://cran.r-project.org/web/packages/simmer/vignettes/A-introduction.html. It appears to me from this site that the creation of the simulator is now done by typing simmer(nameOfMySimulator).

Resources