Using simmer I get an error when I use add_dataframe with an attribute column:
library(simmer)
workerCount <- 2
actualData <- data.frame(
time = c(1:10,1:5), priority = 1:3, service = rnorm(15, 50, 5)) %>%
dplyr::arrange(time)
actualData$gender<-floor(runif(15, min=1, max=3))
activityTraj <- trajectory() %>%
seize('worker') %>%
timeout_from_attribute("service") %>%
release('worker')
env <- simmer() %>%
add_resource('worker', workerCount, Inf, preemptive = TRUE) %>%
add_dataframe('worker_', activityTraj, actualData, mon=2, col_time="time", time="absolute", col_attributes=c("gender")) %>%
run()
The error I get is:
Error: 'worker_0' at 1.00 in [Seize]->Timeout->[Release]:
missing value (NA or NaN returned)
I'm using simmer_4.3.0
Thanks for any suggestions
You are specifying just col_attributes=c("gender"), and therefore the service column is ignored.
Related
Given the following code
library(tidyverse)
library(lubridate)
library(tidymodels)
library(ranger)
df <- read_csv("https://raw.githubusercontent.com/norhther/datasets/main/bitcoin.csv")
df <- df %>%
mutate(Date = dmy(Date),
Change_Percent = str_replace(Change_Percent, "%", ""),
Change_Percent = as.double(Change_Percent)
) %>%
filter(year(Date) > 2017)
int <- interval(ymd("2020-01-20"),
ymd("2022-01-15"))
df <- df %>%
mutate(covid = ifelse(Date %within% int, T, F))
df %>%
ggplot(aes(x = Date, y = Price, color = covid)) +
geom_line()
df <- df %>%
arrange(Date) %>%
mutate(lag1 = lag(Price),
lag2 = lag(lag1),
lag3 = lag(lag2),
profit_next_day = lead(Profit))
# modelatge
df_mod <- df %>%
select(-covid, -Date, -Vol_K, -Profit) %>%
mutate(profit_next_day = as.factor(profit_next_day))
set.seed(42)
data_split <- initial_split(df_mod) # 3/4
train_data <- training(data_split)
test_data <- testing(data_split)
bitcoin_rec <-
recipe(profit_next_day ~ ., data = train_data) %>%
step_naomit(all_outcomes(), all_predictors()) %>%
step_normalize(all_numeric_predictors())
bitcoin_prep <-
prep(bitcoin_rec)
bitcoin_train <- juice(bitcoin_prep)
bitcoin_test <- bake(bitcoin_prep, test_data)
rf_spec <-
rand_forest(trees = 200) %>%
set_engine("ranger", importance = "impurity") %>%
set_mode("classification")
bitcoin_wflow <-
workflow() %>%
add_model(rf_spec) %>%
add_recipe(bitcoin_prep)
bitcoin_fit <-
bitcoin_wflow %>%
fit(data = train_data)
final_model <- last_fit(bitcoin_wflow, data_split)
collect_metrics(final_model)
final_model %>%
extract_workflow() %>%
predict(test_data)
The last chunk of code that extracts the workflow and predicts the test_data is throwing the error:
Error in stop_subscript(): ! Can't subset columns that don't exist.
x Column profit_next_day doesn't exist.
but profit_next_day exists already in test_data, as I checked multiple times, so I don't know what is happening. Never had this error before working with tidymodels.
The problem here comes from using step_naomit() on the outcome. In general, steps that change rows (such as removing them) can be pretty tricky when it comes time to resample or predict on new data. You can read more in detail in our book, but I would suggest that you remove step_naomit() altogether from your recipe and change your earlier code to:
df_mod <- df %>%
select(-covid, -Date, -Vol_K, -Profit) %>%
mutate(profit_next_day = as.factor(profit_next_day)) %>%
na.omit()
Not sure if you all will be able to help me without reproducible example data, but I have a problem with running the code below. I am attempting to use the multidplyr package, but it doesn't seem to find my columns. I am running the code below:
cl <- detectCores()
cl
models_prep <-
bookings_prep %>%
inner_join(pipeline_prep_, by = c("booking_type", "group")) %>%
crossing(biz_day) %>%
left_join(closed_pipeline, by = c("booking_type", "group")) %>%
select(-opportunity_forecast_category)
group1 <- rep(1:cl, length.out = nrow(models_prep))
models_prep1 <- bind_cols(tibble(group1), models_prep)
cluster <- new_cluster(cl)
cluster %>%
cluster_library("tidyr")
cluster %>%
cluster_library("purrr")
cluster %>%
cluster_library("plyr")
cluster %>%
cluster_library("dplyr")
cluster_copy(cluster, "rmf")
cluster_copy(cluster, "fc_xreg")
#cluster_assign(cluster, "rmf")
#cluster_copy(cluster,c("rmf","fc_xreg"))
by_group <- models_prep %>%
group_by(group) %>%
partition(cluster)
by_group1 <- models_prep1 %>%
group_by(group1) %>%
partition(cluster)
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = group, bookings = bookings, type = booking_type,
biz_day = biz_day, no_bookings = no_bookings,
sparse_pipeline = sparse_pipeline,
closed_forecast_cat = pipeline_amount, FUN = "fc_xreg"), rmf))
Everything runs up to models <- correctly, but it fails there saying it cannot find the object group. Here is what the by_group data frame looks like.
Sometimes arguments just need to be quoted, particularly in dplyr-ish situations.
models <- by_group %>%
mutate(
xreg_arima = pmap(list(data = pipeline, h = 1,name = "group", bookings = "bookings", type = "booking_type",
biz_day = "biz_day", no_bookings = "no_bookings",
sparse_pipeline = "sparse_pipeline",
closed_forecast_cat = "pipeline_amount", FUN = "fc_xreg"), rmf))
I have a list of data frames with daily streamflow data.
I want to estimate the maximum daily flow from June to November every year for each data frame in the list that corresponds each of them to data in a station.
This is how the list of data frames looks:
and this is the code I am using:
#Peak mean daily flow summer and fall (June to November)
PeakflowSummerFall <- lapply(listDF,function(x){x %>% group_by(x %>% mutate(year = year(Date)))
%>% filter((x %>% mutate(month = month(Date)) >= 6) & (x %>% mutate(month = month(Date)) <= 11))
%>% summarise(max=max(DailyStreamflow, na.rm =TRUE))})
but I am having this error:
<error/dplyr_error>
Problem with `filter()` input `..1`.
x Input `..1` must be of size 1, not size 24601.
i Input `..1` is `&...`.
i The error occurred in group 1: Date = 1953-06-01, DailyStreamflow = 32, year = 1953.
Backtrace:
Run `rlang::last_trace()` to see the full context
Any solution to this problem?
#### This should give provide you with enough
#### sample data for answerers to work with
install.packages('purrr')
library(purrr)
sample_dat <- listDF %>%
head %>%
map( ~ head(.x))
dput(sample_dat)
#### With that being said...
#### You should flatten the data frame...
#### It's easier to work with...
install.packages('lubridate')
library(lubridate)
listDF %>%
plyr::ldply(rbind) %>%
mutate(month = floor_date(Date, unit = 'month')) %>%
filter(month(Date) > 5, month(Date) < 12) %>%
group_by(.id, month) %>%
dplyr::summarise(max_flow = max(DailyStreamflow)) %>%
split(.$.id)
Given the posted image of the data structure, the following might work.
library(lubridate)
library(dplyr)
listDF %>%
purrr::map(function(x){
x %>%
filter(month(Date) >= 6 & month(Date) <= 11) %>%
group_by(year(Date)) %>%
summarise(Max = max(DailyStreamflow, na.rm = TRUE), .groups = "keep")
})
Test data creation code.
fun <- function(year, n){
d1 <- as.Date(paste(year, 1, 1, sep = "-"))
d2 <- as.Date(paste(year + 10, 12, 31, sep = "-"))
d <- seq(d1, d2, by = "day")
d <- sort(rep(sample(d, n, TRUE), length.out = n))
flow <- sample(10*n, n, TRUE)
data.frame(Date = d, DailyStreamflow = flow)
}
set.seed(2020)
listDF <- lapply(1:3, function(i) fun(c(1953, 1965, 1980)[i], c(24601, 13270, 17761)[i]))
str(listDF)
rm(fun)
I have the following results and I am trying to apply a function within a pipe command.
The code I am using which gives me the error is the following:
sample_rmse_tbl <- dataset %>%
mutate(rmse = map_dbl(predict, calc_rmse)) %>%
select(id, rmse)
Which gives the following error:
Error in mutate_impl(.data, dots) :
Evaluation error: Result 1 is not a length 1 atomic vector.
The data uses the sun spots data and the code that I have is the following (the error I run into is the last line of the code):
I have followed the tutorial carefully and everything works for me up until this line of code.
--- The code is a cut down version of this tutorial: https://www.business-science.io/timeseries-analysis/2018/04/18/keras-lstm-sunspots-time-series-prediction.html
# Core Tidyverse
library(tidyverse)
library(glue)
library(forcats)
# Time Series
library(timetk)
library(tidyquant)
library(tibbletime)
# Visualization
library(cowplot)
# Preprocessing
library(recipes)
# Sampling / Accuracy
library(rsample)
library(yardstick)
# Modeling
library(keras)
sun_spots <- datasets::sunspot.month %>%
tk_tbl() %>%
mutate(index = as_date(index)) %>%
as_tbl_time(index = index)
sun_spots
############################################
periods_train <- 12 * 50
periods_test <- 12 * 10
skip_span <- 12 * 20
rolling_origin_resamples <- rolling_origin(
sun_spots,
initial = periods_train,
assess = periods_test,
cumulative = FALSE,
skip = skip_span
)
rolling_origin_resamples
############################################
calc_rmse <- function(prediction_tbl) {
rmse_calculation <- function(data) {
data %>%
spread(key = key, value = value) %>%
select(-index) %>%
filter(!is.na(predict)) %>%
rename(
truth = actual,
estimate = predict
) %>%
rmse(truth, estimate)
}
safe_rmse <- possibly(rmse_calculation, otherwise = NA)
safe_rmse(prediction_tbl)
}
#############################################
predict_keras_lstm <- function(split, epochs = 300, ...) {
lstm_prediction <- function(split, epochs, ...) {
# 5.1.2 Data Setup
df_trn <- training(split)
df_tst <- testing(split)
df <- bind_rows(
df_trn %>% add_column(key = "training"),
df_tst %>% add_column(key = "testing")
) %>%
as_tbl_time(index = index)
# 5.1.3 Preprocessing
rec_obj <- recipe(value ~ ., df) %>%
step_sqrt(value) %>%
step_center(value) %>%
step_scale(value) %>%
prep()
df_processed_tbl <- bake(rec_obj, df)
center_history <- rec_obj$steps[[2]]$means["value"]
scale_history <- rec_obj$steps[[3]]$sds["value"]
# 5.1.4 LSTM Plan
lag_setting <- 120 # = nrow(df_tst)
batch_size <- 40
train_length <- 440
tsteps <- 1
epochs <- epochs
# 5.1.5 Train/Test Setup
lag_train_tbl <- df_processed_tbl %>%
mutate(value_lag = lag(value, n = lag_setting)) %>%
filter(!is.na(value_lag)) %>%
filter(key == "training") %>%
tail(train_length)
x_train_vec <- lag_train_tbl$value_lag
x_train_arr <- array(data = x_train_vec, dim = c(length(x_train_vec), 1, 1))
y_train_vec <- lag_train_tbl$value
y_train_arr <- array(data = y_train_vec, dim = c(length(y_train_vec), 1))
lag_test_tbl <- df_processed_tbl %>%
mutate(
value_lag = lag(value, n = lag_setting)
) %>%
filter(!is.na(value_lag)) %>%
filter(key == "testing")
x_test_vec <- lag_test_tbl$value_lag
x_test_arr <- array(data = x_test_vec, dim = c(length(x_test_vec), 1, 1))
y_test_vec <- lag_test_tbl$value
y_test_arr <- array(data = y_test_vec, dim = c(length(y_test_vec), 1))
# 5.1.6 LSTM Model
model <- keras_model_sequential()
model %>%
layer_lstm(units = 50,
input_shape = c(tsteps, 1),
batch_size = batch_size,
return_sequences = TRUE,
stateful = TRUE) %>%
layer_lstm(units = 50,
return_sequences = FALSE,
stateful = TRUE) %>%
layer_dense(units = 1)
model %>%
compile(loss = 'mae', optimizer = 'adam')
# 5.1.7 Fitting LSTM
for (i in 1:epochs) {
model %>% fit(x = x_train_arr,
y = y_train_arr,
batch_size = batch_size,
epochs = 1,
verbose = 1,
shuffle = FALSE)
model %>% reset_states()
cat("Epoch: ", i)
}
# 5.1.8 Predict and Return Tidy Data
# Make Predictions
pred_out <- model %>%
predict(x_test_arr, batch_size = batch_size) %>%
.[,1]
# Retransform values
pred_tbl <- tibble(
index = lag_test_tbl$index,
value = (pred_out * scale_history + center_history)^2
)
# Combine actual data with predictions
tbl_1 <- df_trn %>%
add_column(key = "actual")
tbl_2 <- df_tst %>%
add_column(key = "actual")
tbl_3 <- pred_tbl %>%
add_column(key = "predict")
# Create time_bind_rows() to solve dplyr issue
time_bind_rows <- function(data_1, data_2, index) {
index_expr <- enquo(index)
bind_rows(data_1, data_2) %>%
as_tbl_time(index = !! index_expr)
}
ret <- list(tbl_1, tbl_2, tbl_3) %>%
reduce(time_bind_rows, index = index) %>%
arrange(key, index) %>%
mutate(key = as_factor(key))
return(ret)
}
safe_lstm <- possibly(lstm_prediction, otherwise = NA)
safe_lstm(split, epochs, ...)
}
#################################################
sample_predictions_lstm_tbl <- rolling_origin_resamples %>%
mutate(predict = map(splits, predict_keras_lstm, epochs = 10))
sample_predictions_lstm_tbl
sample_predictions_lstm_tbl$predict
map_dbl(sample_predictions_lstm_tbl$predict, calc_rmse)
sample_rmse_tbl <- sample_predictions_lstm_tbl %>%
mutate(rmse = map_dbl(predict, calc_rmse)) %>%
select(id, rmse)
EDIT1:
[[11]]
# A time tibble: 840 x 3
# Index: index
index value key
<date> <dbl> <fct>
1 1949-11-01 144. actual
2 1949-12-01 118. actual
3 1950-01-01 102. actual
4 1950-02-01 94.8 actual
5 1950-03-01 110. actual
6 1950-04-01 113. actual
7 1950-05-01 106. actual
8 1950-06-01 83.6 actual
9 1950-07-01 91 actual
10 1950-08-01 85.2 actual
# ... with 830 more rows
EDIT2:
I come up with a "workaround" but I am getting different results to the article.
temp <- NULL
sample_rmse_tbl <- NULL
for(i in 1:length(sample_predictions_lstm_tbl$predict)){
temp <- calc_rmse(sample_predictions_lstm_tbl$predict[[i]])
sample_rmse_tbl[[i]] <- temp
}
sample_rmse_tbl <- do.call(rbind.data.frame, sample_rmse_tbl)
sample_rmse_tbl %>%
setNames(., c("metric", "estimator", "rmse")) %>%
mutate(id = row_number()) %>%
select(id, rmse)
The problem is that the function rmse() returns a list rather than a single double value. You need to select the estimate value from this list using .$.estimate. However, I had to remove the possibly() call to make my solution work.
So, the new function calc_rmse() looks like this.
calc_rmse <- function(prediction_tbl) {
rmse_calculation <- function(data) {
data %>%
spread(key = key, value = value) %>%
select(-index) %>%
filter(!is.na(predict)) %>%
rename(
truth = actual,
estimate = predict
) %>%
rmse(truth, estimate) %>% .$.estimate
}
rmse_calculation(prediction_tbl)
}
I've got data on actual events and I need to model what might have happened if different resources were available. The next stage will be to build a "proper" simulation where events and times are created more randomly. My problem is that I can't work out how to ensure a specific activity gets assigned the start time, priority and timeout which it had in real life.
library(simmer)
set.seed(654)
env <- simmer()
workerCount <- 2
actualData <- data.frame(arrTime = c(1:10,1:5),
priority = 1:3, duration = rnorm(15, 50, 5))
activityTraj <- trajectory() %>%
seize('worker') %>%
timeout(5) %>%
release('worker')
env %>%
add_resource('worker', workerCount, Inf, preemptive = TRUE) %>%
add_generator('worker', activityTraj, at(actualData$arrTime),
mon = 2, priority = 2)
env %>% run(50)
What I need to do in the above is to make the priority in the generator read from the data frame (currently hard coded at 2) and the timeout (currently hard coded at 5) in the trajectory also read from the data frame. I can't see how I can ensure that the row that specifies the priority and time of the activity will also be used to specify the duration (or "timeout").
First of all, you must ensure that your actualData frame is sorted by arrTime:
actualData <- data.frame(arrTime = c(1:10,1:5),
priority = 1:3,
duration = rnorm(15, 50, 5)) %>%
dplyr::arrange(arrTime)
Then, let's build a helper function to consume the columns of your actualData:
consume <- function(x, prio=FALSE) {
i <- 0
function() {
i <<- i + 1
if (prio) c(x[[i]], x[[i]], FALSE)
else x[[i]]
}
}
which can be applied to your trajectory as follows:
activityTraj <- trajectory() %>%
set_prioritization(consume(actualData$priority, TRUE)) %>%
set_attribute("duration", consume(actualData$duration)) %>%
seize('worker') %>%
timeout(function(attr) attr["duration"]) %>%
release('worker')
because your arrivals are sorted. Finally, let's run the simulation:
env %>%
add_resource('worker', workerCount, Inf, preemptive = TRUE) %>%
add_generator('worker_', activityTraj, at(actualData$arrTime)) %>%
run()
and check that the actual durations were ok:
activity_time <- get_mon_arrivals(env) %>%
tidyr::separate(name, c("prefix", "n"), convert=TRUE) %>%
dplyr::arrange(n) %>%
dplyr::pull(activity_time)
all(activity_time == actualData$duration)
#> TRUE
UPDATE: Since simmer v3.8.0, the new data source add_dataframe greatly simplifies this kind of pattern:
library(simmer)
workerCount <- 2
actualData <- data.frame(
time = c(1:10,1:5), priority = 1:3, service = rnorm(15, 50, 5)) %>%
dplyr::arrange(time)
activityTraj <- trajectory() %>%
seize('worker') %>%
timeout_from_attribute("service") %>%
release('worker')
env <- simmer() %>%
add_resource('worker', workerCount, Inf, preemptive = TRUE) %>%
add_dataframe('worker_', activityTraj, actualData, time="absolute") %>%
run()
activity_time <- get_mon_arrivals(env) %>%
tidyr::separate(name, c("prefix", "n"), convert=TRUE) %>%
dplyr::arrange(n) %>%
dplyr::pull(activity_time)
all(activity_time == actualData$duration)
#> TRUE