Simmer Get_attribute | there is no arrival running error - r

first of all, this simmer_vignette and this linkadvanced_simmer_usage seem to indicate that the error stems from the fact that "get_name, get_attribute and get_prioritization are meant to be used inside a trajectory; otherwise, there will be no arrival running and these functions will throw an error"
A minimal workable example:
patient_traj <- trajectory(name = "patient_trajectory") %>%
set_attribute("my_key", 123) %>%
timeout(5) %>%
set_attribute("my_key", function() get_attribute(env, "my_key") + 1) %>%
timeout(5) %>%
set_attribute("dependent_key", function() ifelse(get_attribute(env, "my_key")<=123, 1, 0)) %>%
timeout(5) %>%
set_attribute("independent_key", function() runif(1))
env<- simmer() %>%
add_generator("patient", patient_traj, at(0), mon = 2)
env %>% run()
#> simmer environment: anonymous | now: 15 | next:
#> { Generator: patient | monitored: 2 | n_generated: 1 }
get_mon_attributes(env)
#> time name key value replication
#> 1 0 patient0 my_key 123.0000000 1
#> 2 5 patient0 my_key 124.0000000 1
#> 3 10 patient0 dependent_key 0.0000000 1
#> 4 15 patient0 independent_key 0.9234335 1
Now this works as it's supposed to work, the problem starts when I try to call get_attribute() in any other sense. Adding this line after set_attribute() at the very end of the trajectory definition:
log_(get_attribute(env, "independent_key"))
throws the abovementioned error.
What I actually want to do is call the "leave" function and give it as a probability an attribute. I still do this in the trajectory.
leave(prob = get_attribute(env, "independent_key"))
Needless to say, this also throws the error "Error in get_attribute_(private$sim_obj, key, global) : there is no arrival running".
Does anyone know what might cause this? I feel like the only option is the above explanatio "get_attribute is meant to be used inside a trajectory" - but I feel like I am doing this.
Thanks already!

Okay, I am embarrased to say this but the problem was rather easily fixed. It seems as if the problem was to access the attribute directly.
So log_(get_attribute(env, "independent_key")) does not work, but log_(function() get_attribute(env, "independent_key")) does.
That's all it takes.
If anyone has an explanation as to why that is all it takes, I would highly appreciate it.

Related

curl error (Could not resolve host: NA) while scraping in a loop

While this code for scraping prices from a webshop has worked perfectly fine for me over the last months, today I just got the following error message:
Error in curl::curl_fetch_memory(url, handle = handle) :
Could not resolve host: NA
The code i use is as follows:
This part is for getting the full url's:
#Scrape Galaxus
vec_galaxus<-vector()
i=0
input_galaxus <- input %>%
filter(`Galaxus Artikel`!=0)
input_galaxus2<-paste0('https://www.galaxus.ch/',input_galaxus$`Galaxus Artikel`)
This is the scraping loop:
sess <- session(input_galaxus2[1]) #to start the session
for (j in input_galaxus2){
sess <- sess %>% session_jump_to(j) #jump to URL
i=i+1
try(vec_galaxus[i] <- read_html(sess) %>% #can read direct from sess
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text()%>%
str_extract("[0-9]+") %>%
as.integer())
Sys.sleep(runif(1, min=0.2, max=0.5))
}
where part of my input "input_galaxus2" looks like this:
c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
Does anybody know why suddenly this code gives me the above error message?
Thanks in advance for your responses!
If it were a different error, I'd think it was throttling, but this error does not really support that. However, to troubleshoot that (and you hitting too-many-hits limits on the server), try introducing a delay between pulls, perhaps a few seconds or a minute, just to see if that resolves things.
Here's a method that will allow to you repeat your code until all URLs are pulled without error. Note that this may also need the "delay" I suggested above in order to not anger the server admins on the remote end (or firewall or whatever).
Create a list in which we'll store the results. Run this code only once, all the remaining bullets in the list should be repeatable without consequence.
out <- vector("list", length(input_galaxus2))
Prep the session. This may be repeatable depending on if you have authentication or other attributes.
sess <- session(input_galaxus2[1]) #to start the session
Iterate over the empty elements of your URLs and query as needed. If you get any errors, feel free to wait a little bit and re-run this code. If a particular URL succeeded, it will not be re-attempted, so repeat as needed, eventually (assuming the failures are intermittent and all URLs are value) you will get all results.
I don't think you need read_html in this pipe, but I'm not testing for fear of "slashdotting" the website. The point of this answer is to suggest a mechanism that allows you to reattempt efficiently.
empties <- which(sapply(out, is.null))
for (i in empties) {
res <- tryCatch({
sess %>%
session_jump_to(input_galaxus2[i]) %>%
html_nodes('.sc-1aeovxo-1.gvrGle') %>%
html_text() %>%
str_extract("[0-9]+") %>%
as.integer()
}, error = function(e) e)
if (inherits(res, "error")) {
warning(sprintf("failed (%i, %s): %s", i, input_galaxus2[i], conditionMessage(e)))
# optional
Sys.sleep(3)
} else out[[i]] <- res
}
Note: this assumes that a NULL value means the previous attempt failed, was interrupted, or ... was not attempted. If NULL can be a valid and successful return value from your pull, then you should likely prefill out with some other "canary" value: choose something that you are more confident will "never" appear in real results, and change how you define empties above.
Using purrr::map instead of loop, without any Sys.sleep().
library(tidyverse)
library(rvest)
df <- tibble(
links = c("https://www.galaxus.ch/15758734", "https://www.galaxus.ch/7362734",
"https://www.galaxus.ch/12073455", "https://www.galaxus.ch/20841274",
"https://www.galaxus.ch/20589944 ", "https://www.galaxus.ch/13595276",
"https://www.galaxus.ch/16255768", "https://www.galaxus.ch/6296373",
"https://www.galaxus.ch/14513900", "https://www.galaxus.ch/14465626",
"https://www.galaxus.ch/10592707", "https://www.galaxus.ch/19958785",
"https://www.galaxus.ch/9858343", "https://www.galaxus.ch/14513913")
)
get_prices <- function(link) {
link %>%
read_html() %>%
html_nodes(".sc-1aeovxo-1.gvrGle") %>%
html_text2() %>%
str_remove_all("–")
}
df %>%
mutate(price= map(links, get_prices) %>%
as.numeric)
# A tibble: 14 × 2
links price
<chr> <dbl>
1 "https://www.galaxus.ch/15758734" 17.8
2 "https://www.galaxus.ch/7362734" 500.
3 "https://www.galaxus.ch/12073455" 173
4 "https://www.galaxus.ch/20841274" 112
5 "https://www.galaxus.ch/20589944 " 25.4
6 "https://www.galaxus.ch/13595276" 313
7 "https://www.galaxus.ch/16255768" 40
8 "https://www.galaxus.ch/6296373" 62.9
9 "https://www.galaxus.ch/14513900" 539
10 "https://www.galaxus.ch/14465626" 466.
11 "https://www.galaxus.ch/10592707" 63.5
12 "https://www.galaxus.ch/19958785" NA
13 "https://www.galaxus.ch/9858343" 7.3
14 "https://www.galaxus.ch/14513913" 617

r - check if columns renamed when unnesting

Is there a way to know if a table was renamed in the process of unnesting? I want to know if there is something where I can intercept any messages that come through with New names: and give more context about solutions
# min reprex
library(tidyverse)
f <- function() {
tibble(
x = 1:2,
y = 2:1,
z = tibble(x = 1)
) |>
unnest_wider(z, names_repair = "unique")
}
f()
New names:
• `x` -> `x...1`
• `x` -> `x...3`
x...1 y x...3
----- ----- -----
1 2 1
2 1 1
More context:
The message stems from
vctrs::vec_as_names(c("x", "x"), repair = "universal")
I see information about withCallingHandlers() but not sure if that is the right route. I thought there was a way for errors/messages to have classes that you can intercept but I can't remember what I read.
Something in testthat::expect_message() may help. I thought there would be a has_message() function out there.
There is a lot of tidy evaluation and comparing names before and after might be tricky. I could look for the names with the regex "\\.+\\d+$" but not sure that is robust enough since data could have fields with that syntax already.
Thank you!
Taking inspiration from hadley's answer, on the question that #ritchie-sacramento linked, you should check out the evaluate package.
> eval_res <- evaluate::evaluate("f()")
> eval_res[[2]]$message
[1] "New names:\n* x -> x...1\n* x -> x...3\n"
This will require more testing to see what happens to the data structure when there are errors, warnings, or even multiple messages. But this seems like the right track.

Getting different results from the same function inside and outside a mutate function call

Can someone explain to me why I get a different result when I run the convertToDisplayTime function inside mutate than when I run it on its own? The correct result is the one I obtain when I run it on its own. Also, why do I get these warnings? It feels like I might be passing the whole timeInSeconds column as an argument when I call convertToDisplayTime in the mutate function, but I'm not sure that I really understand the mechanics in play here.
library('tidyverse')
#> Warning: package 'tibble' was built under R version 4.1.2
convertToDisplayTime <- function(timeInSeconds){
## Takes a time in seconds and converts it
## to a xx:xx:xx string format
if(timeInSeconds>86400){ #Not handling time over a day
stop(simpleError("Enter a time below 86400 seconds (1 day)"))
} else if(timeInSeconds>3600){
numberOfMinutes = 0
numberOfHours = timeInSeconds%/%3600
remainingSeconds = timeInSeconds%%3600
if(remainingSeconds>60){
numberOfMinutes = remainingSeconds%/%60
remainingSeconds = remainingSeconds%%60
}
if(numberOfMinutes<10){displayMinutes = paste0("0",numberOfMinutes)}
else{displayMinutes = numberOfMinutes}
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfHours,":",displayMinutes,":", displaySeconds))
} else if(timeInSeconds>60){
numberOfMinutes = timeInSeconds%/%60
remainingSeconds = timeInSeconds%%60
remainingSeconds = round(remainingSeconds)
if(remainingSeconds<10){displaySeconds = paste0("0",remainingSeconds)}
else{displaySeconds = remainingSeconds}
return(paste0(numberOfMinutes,":", displaySeconds))
} else{
return(paste0("0:",timeInSeconds))
}
}
(df <- tibble(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10)) %>% mutate(displayTime = convertToDisplayTime(timeInSeconds)))
#> Warning in if (timeInSeconds > 86400) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 3600) {: the condition has length > 1 and only
#> the first element will be used
#> Warning in if (timeInSeconds > 60) {: the condition has length > 1 and only the
#> first element will be used
#> Warning in if (remainingSeconds < 10) {: the condition has length > 1 and only
#> the first element will be used
#> # A tibble: 4 x 2
#> timeInSeconds displayTime
#> <dbl> <chr>
#> 1 2710. 45:10
#> 2 2705. 45:5
#> 3 2692. 44:52
#> 4 2708. 45:8
convertToDisplayTime(2710.46)
#> [1] "45:10"
convertToDisplayTime(2705.04)
#> [1] "45:05"
convertToDisplayTime(2691.66)
#> [1] "44:52"
convertToDisplayTime(2708.10)
#> [1] "45:08"
Created on 2022-01-06 by the reprex package (v2.0.1)
Like mentioned in the comments, the problem here is that your function is not vectorized: it works with a single value for an input and outputs a single value. However, this does not work when the input is a vector of values, hence the condition has length 1 warning you get:
1: Problem with `mutate()` column `displayTime`.\
ℹ `displayTime = convertToDisplayTime(timeInSeconds)`.
ℹ the condition has length > 1 and only the first element will be used
Here, when you use dplyr::mutate, you're technically trying to feed a vector to your function, which is not formatted to process it.
Several options you may consider:
1. The "fast and ugly" way:
df <- data.frame(timeInSeconds = c(2710.46, 2705.04, 2691.66, 2708.10))
## This one does not work
df %>% mutate(displayTime = convertToDisplayTime(timeInSeconds))
## This one works
df %>%
rowwise() %>%
mutate(displayTime = convertToDisplayTime(timeInSeconds)) %>%
ungroup()
dplyr::rowwise() allows dplyr::mutate() to work on each row independently, rather than by columns. I assume this is the behavior you initially expected. dplyr::ungroup() sorta reverts rowwise, eg. go back to the default column-wise behavior.
I may be a little harsh on this one, but this is the kind of trick that I used back when I did not quite understand my way around dataframes and their manipulation...
2. Vectorize directly from your dplyr verbs:
df %>%
mutate(displayTime = base::mapply(convertToDisplayTime, timeInSeconds))
## or
df %>%
mutate(displayTime = purrr::map_chr(timeInSeconds, convertToDisplayTime))
Both options are similar.
3. Vectorize your function:
convertToDisplayTime_vec <- base::Vectorize(convertToDisplayTime)
# class(convertToDisplayTime_vec)
df %>% mutate(displayTime = convertToDisplayTime_vec(timeInSeconds))
## or
convertToDisplayTime_vec2 <- function(timeInSeconds_vec) {
mapply(FUN = convertToDisplayTime, timeInSeconds_vec)
}
# class(convertToDisplayTime_vec2)
df %>%
mutate(displayTime = convertToDisplayTime_vec2(timeInSeconds))
# Still works on single variables!
# convertToDisplayTime_vec2(6475)
This is my favourite option, as once it is implemented you can use it either on single variables, vectors or dataframes, without worring about it.
A little documentation to dig a little into the subject.
PS: As an aside, a little tip worth remembering: you may want to be careful when manipulating data.frame and tibble objects. Despite their similarity, they have slight differences, and some functions deal differently with one or the other, or actually convert one to the other without your noticing...

Report extra information from a test_that block when failing

I want to cat() some information to the console in the case a test fails (I'm getting confident this won't happen but I can't prove it wont) so I can investigate the issue.
Now I have code that is approximately like this:
testthat::test_that('Maybe fails', {
seed <- as.integer(Sys.time())
set.seed(seed)
testthat::expect_true(maybe_fails(runif(100L)))
testthat::expect_equal(long_vector(runif(100L)), target, tol = 1e-8)
if (failed()) {
cat('seed: ', seed, '\n')
}
})
Unfortunately, failed() doesn't exist.
Return values of expect_*() don't seem useful, they just return the actual argument.
I'm considering to just check again using all.equal() but that is a pretty ugly duplication.
Instead of using cat, you could use the info argument managed by testthat and its reporters for all expect functions (argument kept for compatibility reasons):
library(testthat)
testthat::test_that("Some tests",{
testthat::expect_equal(1,2,info=paste('Test 1 failed at',Sys.time()))
testthat::expect_equal(1,1,info=paste('Test 2 failed at',sys.time()))
})
#> -- Failure (<text>:5:3): Some tests --------------------------------------------
#> 1 not equal to 2.
#> 1/1 mismatches
#> [1] 1 - 2 == -1
#> Test 1 failed at 2021-03-03 17:25:37

How do I use rvest to sort text into different columns?

I am using rvest to (try to) scrape all the author affiliation data from a database of academic publications called RePEc. I have the authors' short IDs, which I'm using to scrape affiliation data. However, each time I try, it gives me the 404 error: Error in open.connection(x, "rb") : HTTP error 404
It must be an issue with my use of sapply because when I test it using an individual ID, it works. Here is the code I'm using:
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html")
df$affiliation <- sapply(df$websites, function(x) try(x %>% read_html %>% html_nodes("#affiliation h3") %>% html_text()))
I actually need to do this for six columns of authors and there are NA values I'd like to skip so if anyone knows how to do that as well, I would be enormously grateful (but not a big deal if I not). Thank you in advance for your help!
EDIT: I have just discovered that the error is in the formula for the websites. Sometimes it should be df$websites <- paste0("https://ideas.repec.org/e/", df$author_reg, ".html") and sometimes it should be df$websites <- paste0("https://ideas.repec.org/f/", df$author_reg, ".html")
Does anyone know how to get R to try both and give me the one that works?
You can have the two links and use try on bottom of them. I am assuming there is only 1 that would give a valid website. Otherwise we can always edit the code to take in everything that works:
library(rvest)
library(purrr)
df = data.frame(id=1:6)
df$author_reg <- c("paa6","paa2","paa1", "paa8", "pve266", "pya500")
http1 <- "https://ideas.repec.org/e/"
http2 <- "https://ideas.repec.org/f/"
df$affiliation <- sapply(df$author_reg, function(x){
links = c(paste0(http1, x, ".html"),paste0(http2, x, ".html"))
# here we try both links and store under attempt
attempts = links %>% map(function(i){
try(read_html(i) %>% html_nodes("#affiliation h3") %>% html_text())
})
# the good ones will have "character" class, the failed ones, try-error
gdlink = which(sapply(attempts,class) != "try-error")
if(length(gdlink)>0){
return(attempts[[gdlink[1]]])
}
else{
return("True 404 error")
}
})
Check the results:
df
id author_reg
1 1 paa6
2 2 paa2
3 3 paa1
4 4 paa8
5 5 pve266
6 6 pya500
affiliation
1 Statistisk SentralbyråGovernment of Norway
2 Department of EconomicsCollege of BusinessUniversity of Wyoming
3 (80%) Institutt for ØkonomiUniversitetet i Bergen, (20%) Gruppe for trygdeøkonomiInstitutt for ØkonomiUniversitetet i Bergen
4 Centraal Planbureau (CPB)Government of the Netherlands
5 Department of FinanceRotterdam School of Management (RSM Erasmus University)Erasmus Universiteit Rotterdam
6 Business SchoolSwinburne University of Technology

Resources