Error: in [Seize]->Timeout->[Release]: Expecting a single value: [extent=11] - r

I'm using R simmer to do a simulation. However, I receive this error message every time when I run it:
Error: 'truck0' at 48.73 in [Seize]->Timeout->[Release]: Expecting a
single value: [extent=11].
What is wrong with this?
This is my R script:
rm(list=ls())
#load packages
library(simmer)
library(simmer.plot)
#create an simulation environment
env <- simmer("Terminal")
env
#create a truck trajectory
truck <- trajectory("Truck path", verbose = TRUE)
truck
#draw model
truck %>%
seize("frontdesk",1) %>%
timeout(function() rnorm(11.27671,3.233562)) %>%
release("frontdesk",1) %>%
seize("gate-in",1) %>%
timeout(function() rnorm(17.54509,9.915719)) %>%
release("gate-in",1) %>%
seize("station",1) %>%
timeout(function() rnorm(12.68418,12.55247)) %>%
release("station",1) %>%
seize("lashing",1) %>%
timeout(function() rnorm(28.87726,21.0809)) %>%
release("lashing",1) %>%
seize("control",1) %>%
timeout(function() rnorm(12.70417,3.711475)) %>%
release("control",1) %>%
seize("frontdesk end",1) %>%
timeout(function()rnorm(11.27671,3.233562)) %>%
release("frontdesk end",1)
env <- lapply(1:100, function(i) {
simmer("Terminal") %>%
add_resource("frontdesk", 2) %>%
add_resource("gate-in", 2) %>%
add_resource("station", 1) %>%
add_resource("lashing", 15) %>%
add_resource("control", 1) %>%
add_resource("frontdesk end", 2) %>%
add_generator(name = "truck" ,
trajectory = truck,
distribution = function() rnorm(1,24.992,36.015)) %>%
run(660) %>%
wrap()
})

As the error indicates, timeout activities expect a single value, and you are providing 11 in this case. Because of this:
timeout(function() rnorm(11.27671,3.233562))
rnorm's first parameter is the number of samples (which is rounded to 11 in this case). What are you trying to do here? If that's supposed to be mean=11.27, sd=3.23, then you need to add
timeout(function() rnorm(1, 11.27671,3.233562))
so that you get a single sample per call, as required. And the same applies for all the other timeouts.
EDIT: Also, I do not recommend using a normal distribution for service times, because a normal distribution may return negative values (that are by default coerced to positive), and thus you may get unexpected results.

Related

R Function defined in gloabl environment not visible in subsequent function

I have defined four functions. I have executed the code for each and all four appear in the global environment when I call ls().
The first two are used inside the third and this works as expected. However, when I call the third function from the fourth function I get an error message telling me that curent_month doesn't exist.
(I eliminated all code from the fourth function as the failure occurs at the first statement, so the rest is not relevant.)
I have always understood that any object defined in the global environment is available to any sub-environment (i.e., inside a function).
Can anyone point me in the right direction?
## Function returns the most recent month having billing revenues
current_month_POSIX <- function(x){
## Fetch current month name for use in label below
current_month_POSIX <- x %>%
filter(Year == 2020) %>%
filter(!is.na(Billing)) %>%
select(Month) %>%
unique()%>%
arrange() %>%
tail(1) %>%
unlist() %>%
as_datetime()
return(current_month_POSIX)
}
current_month_name <- function(x){
current_month_name <- x %>%
filter(Year == 2020) %>%
filter(!is.na(Billing)) %>%
select(Month, month_name) %>%
unique()%>%
arrange() %>%
tail(1) %>%
select(month_name) %>%
substr(.,1,3)
return(current_month_name)
}
curent_month <- function(x){
POSIX <- current_month_POSIX(x)
name <- current_month_name(x)
return(list("current_month_name" = name, "current_month_POSIX" = POSIX))
}
### Function to reduce source data to clustered bar chart table
clustered_bar_data <- function(x){
latest_month <- current_month(x)
}
current_month does not exist! You named your function curent_month.

Error while using unnest_tokens() while passing a function to the token

Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, :
Expected output of tokenizing function to be a list of length 100
The unnest_tokens() works well for a sample of few observations but fails on the entire dataset.
https://github.com/dgrtwo/cord19
Reproducible example:
library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)
Install the model from here - https://github.com/allenai/scispacy
spacy_initialize("en_core_sci_sm")
tokenize_scispacy_entities <- function(text) {
spacy_extract_entity(text) %>%
group_by(doc_id) %>%
nest() %>%
pull(data) %>%
map("text") %>%
map(str_to_lower)
}
paragraph_entities <- cord19_paragraphs %>%
select(paper_id, text) %>%
sample_n(10) %>%
unnest_tokens(entity, text, token = tokenize_scispacy_entities)
I face the same problem. I don't know the reason why, after I filter out empty abstract and shorter abstract string, everything seems work just fine.
abstract_entities <- article_data %>%
filter(nchar(abstract) > 30) %>%
select(paper_id, title, abstract) %>%
sample_n(1000) %>%
unnest_tokens(entity, abstract, token = tokenize_scispacy_entities)

Scraping pages with inconsistent lengths in dataframe

I want to scrape all the names from this page. With the result of one tibble of three columns. My code only works if all the data is there hence my error:
Error: Tibble columns must have consistent lengths, only values of length one are recycled:
* Length 20: Columns `huisarts`, `url`
* Length 21: Column `praktijk`
How can I let my code run but fill with Na's in tibble if the data isn't there.
My code for a pauzing robot later used in scraper function:
pauzing_robot <- function (periods = c(0, 1)) {
tictoc <- runif(1, periods[1], periods[2])
cat(paste0(Sys.time()),
"- Sleeping for ", round(tictoc, 2), "seconds\n")
Sys.sleep(tictoc)
}
Scraper:
library(tidyverse)
library(rvest)
scrape_page <- function(pagina_nummer) {
page <- read_html(paste0("https://www.zorgkaartnederland.nl/huisarts/pagina", pagina_nummer))
pauzing_robot(periods = c(0, 1.5))
tibble(
huisarts = page %>%
html_nodes(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = page %>%
html_nodes(".location") %>%
html_text() %>%
str_trim(),
url = page %>%
html_nodes(".media-heading.title.orange") %>%
html_nodes("a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
}
Total number of pages 445, but for example sake only scraping three:
huisartsen <- map_df(sample(1:3), scrape_page)
Page 2 seems to be the problem with inconsistent lengths because this code works:
huisartsen <- map_df(3:4, scrape_page)
If possible with tidyverse code. Thanks in advance.
You need to retrieve the list of parent nodes
parents <- page %>% html_nodes("li.media")
Then parse the parent nodes with function html_node().
tibble(
huisarts = parents %>%
html_node(".media-heading.title.orange") %>%
html_text() %>%
str_trim(),
praktijk = parents %>%
html_node(".location") %>%
html_text() %>%
str_trim(),
url = parents %>%
html_node(".media-heading.title.orange a") %>%
html_attr("href") %>%
str_trim() %>%
paste0("https://www.zorgkaartnederland.nl", .)
)
The html_node function will always return a value even if it is just a NA

Printing intermediate results without breaking pipeline in tidyverse

Is there a command to add to tidyverse pipelines that does not break the flow, but produces some side effect, like printing something out. The usecase I have in mind is something like this. In case of a pipeline
data %>%
mutate(new_var = <some time consuming operation>) %>%
mutate(new_var2 = <some other time consuming operation>) %>%
...
I would like to add some command to the pipeline that would not modify the end result, but would print out some progress or the state of things. Maybe something like this:
data %>%
mutate(new_var = <some time consuming operation>) %>%
command_x(print("first operation done")) %>%
mutate(new_var2 = <some other time consuming operation>) %>%
...
Does there exist such command_x already?
For the specific case of printing an intermediate step in the pipeline, just use %>% print() %>%. E.g.,
mtcars %>%
filter(cyl == 4) %>%
print() %>%
summarise(mpg = mean(mpg))
For a simple status message, you'd do:
pipe_message = function(.data, status) {message(status); .data}
mtcars %>%
filter(cyl == 4) %>%
pipe_message("first operation done") %>%
select(cyl)
See the answer by #MrFlick for a more general solution for non-print functions.
You could easily write your own function
pass_through <- function(data, fun) {fun(data); data}
And use it like
mtcars %>% pass_through(. %>% ncol %>% print) %>% nrow
Here we use the . %>% syntax to create an anonymous function. You could also write your own more explicitly with
mtcars %>% pass_through(function(x) print(ncol(x))) %>% nrow
You can do on the fly with an anonymous function:
mtcars %>% ( function(x){print(x); return(x)} ) %>% nrow()

rvest web content scraping issue / car trading website

Question
I wanted to rvest specific parts of the websites (car sales platform).
The CSS is frankly too confusing for me to figure out what's wrong on my own.
#### scraping the website www.otomoto.pl with used cars #####
baseURL_otomoto = "https://www.otomoto.pl/osobowe/?page="
i <- 1
for ( i in 1:7000 )
{
link = paste0(baseURL_otomoto,i)
out = read_html(link)
print(i)
print(link)
### building year
build_year = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
mileage = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[2]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
volume = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[3]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
fuel_type = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[4]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
price = html_nodes(out, xpath = '//div[#class="offer-item__price"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
link = html_nodes(out, xpath = '//div[#class="offer-item__title"]') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
offer_details = html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
Any guesses what might be the reason for this behaviour?
PS#1.
How to rvest all build_type, mileage and fuel_type data from offers available on the analysed website at once as a data.frame? using classes (xpath = '//div[#class=...) didn't work in my case
PS#2.
I wanted to rvest details of the actual offers using f.i.
gear_type = html_nodes(out, xpath = '//*[#id="parameters"]/ul[1]/li[10]/div') %>%
html_text() %>%
str_replace_all("\n","") %>%
str_replace_all("\r","") %>%
str_trim()
the arguments
in ul[a] are for a in (1:2) &
in li[b] are for b in (1:12)
Unfortunately though this concept fails as the resulting data frame is empty. Any guesses why?
First and foremost, learn about CSS selectors and XPath. Your selectors are very long and extremely fragile (some of them did not work for me at all, mere two weeks later). For example, instead of:
html_nodes(out, xpath = '//*[#id="body-container"]/div[2]/div[1]/div/div[6]/div[2]/article[1]/div[2]/div[3]/ul/li[1]') %>%
html_text()
you can write:
html_nodes(out, css="[data-code=year]") %>% html_text()
Second, read documentation of libraries that you use. str_replace_all pattern may be regular expression, which saves you one call (use str_replace_all("[\n\r]", "") instead of str_replace_all("\n","") %>% str_replace_all("\r","")). html_text can do text trimming for you, which means that str_trim() is not needed at all.
Third, if you copy-paste some code, step back and think if function wouldn't be better solution; usually it would. In your case, personally, I would probably skip str_replace_all calls until data cleaning step, when I would call them on data.frame holding entire scrapped data.
To create data.frame from your data, call data.frame() function with column names and content, like that:
data.frame(build_year = build_year,
mileage = mileage,
volume = volume,
fuel_type = fuel_type,
price = price,
link = link,
offer_details = offer_details)
Or you could initialize data.frame with one column only and then add further vectors as columns:
output_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
output_df$volume <- html_nodes(out, css="[data-code=engine_capacity]") %>%
html_text(TRUE)
Finally, you should note that data.frame columns must all be the same length, while some of data that you scrap is optional. At the moment of writing this answer I had few offers without engine capacity and without offer description. You have to use two html_nodes calls in succession (as single CSS selector will not match what doesn't exist). But even then, html_nodes will silently drop missing data. This can be worked around by piping html_nodes output to html_node call:
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
The final version of my approach to loop internals is below. Just make sure that you initialize empty data.frame before calling it and that you merge output of current iteration with final data frame (using for example rbind), or each iteration will overwrite results of previous one. Or you could use do.call(rbind, lapply()), which is idiomatic R for such task.
As a side note, when scraping large amount of quickly changing data, consider decoupling data downloading and data processing steps. Imagine that there is some corner case that you haven't accounted for which will cause R to terminate. How will you proceed if such condition appear in the middle of your iterations? The longer you stay on one page, the more duplicates you introduce (as more offers appear and existing ones are pushed down on further pages), and more offers you miss (as sale is concluded and offers disappear forever).
current_df <- data.frame(build_year = html_nodes(out, css="[data-code=year]") %>% html_text(TRUE))
current_df$mileage = html_nodes(out, css="[data-code=mileage]") %>%
html_text(TRUE)
current_df$volume = out %>% html_nodes("ul.offer-item__params") %>%
html_node("[data-code=engine_capacity]") %>%
html_text(TRUE)
current_df$fuel_type = html_nodes(out, css="[data-code=fuel_type]") %>%
html_text(TRUE)
current_df$price = out %>% html_nodes(xpath="//div[#class='offer-price']//span[contains(#class, 'number')]") %>%
html_text(TRUE)
current_df$link = out %>% html_nodes(css = "div.offer-item__title h2 > a") %>%
html_text(TRUE) %>%
str_replace_all("[\n\r]", "")
current_df$offer_details = out %>% html_nodes("div.offer-item__title") %>%
html_node("h3") %>%
html_text(TRUE)

Resources