Related
I wrote the following code in R
library(fda)
n_curves <- 15951
n_points <- 2537
argvals <- matrix(df_l$Time, nrow = n_points, ncol = n_curves)
y_mat <- matrix(df_l$Curve, nrow = n_points, ncol = n_curves)
W.obj <- Data2fd(argvals = argvals, y = y_mat, basisobj = basis, lambda = 0.5)
But I'm getting an error
Error in if ((a01[1] <= arng[1]) && (arng[2] <= a01[2])) { :
missing value where TRUE/FALSE needed
What does it mean, and how do I prevent it?
I'm using a repeated measures data, and I`m trying to do functional data analysis.My data has a lot of missing values(NA). I'm thinking that NA is probably the cause of something.
data:
> dput(head(df_l, 30))
structure(list(Time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), Curve = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 5, 10, 10, 10, 10, 8, 8, 8, 8,
8, 8)), row.names = c(NA, 30L), class = "data.frame")
> dput(head(basis, 5))
list(call = basisfd(type = type, rangeval = rangeval, nbasis = nbasis,
params = params, dropind = dropind, quadvals = quadvals,
values = values, basisvalues = basisvalues), type = "bspline",
rangeval = c(0, 2537), nbasis = 53, params = c(50.74, 101.48,
152.22, 202.96, 253.7, 304.44, 355.18, 405.92, 456.66, 507.4,
558.14, 608.88, 659.62, 710.36, 761.1, 811.84, 862.58, 913.32,
964.06, 1014.8, 1065.54, 1116.28, 1167.02, 1217.76, 1268.5,
1319.24, 1369.98, 1420.72, 1471.46, 1522.2, 1572.94, 1623.68,
1674.42, 1725.16, 1775.9, 1826.64, 1877.38, 1928.12, 1978.86,
2029.6, 2080.34, 2131.08, 2181.82, 2232.56, 2283.3, 2334.04,
2384.78, 2435.52, 2486.26))
I have a pair data as below and I want to make the expected value of the difference in the value (column called value) of pairs. In all the pairs, one has disease and the other one does not have disease as you can see from the data. In other words, the expected value of the difference of the value in one sibling compare to his/her sibling.
The description of the variable in the data are:
id = individual ID
family ID = family ID showing their dependency
status = 1 means disease and status = 0 means no-disease
Any guidance is appreciated.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
I'm not certain if this is what you are looking for, but I used pivot_wider from tidyr to spread the values into two columns, though with status 0 and those with status 1. Then I used mutate to take a difference between the two columns, then plotted the familyID by the newly created difference with ggplot. Note that I removed the id column for the pivot_wider to work.
d <- structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20),
familyID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10),
status = c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
library(dplyr)
library(tidyr)
library(ggplot2)
d%>%
select(-id)%>%
pivot_wider(values_from = value, names_from = status)%>%
mutate("Diff" = (`0`-`1`))%>%
ggplot()+
aes(as.character(familyID), Diff)+
geom_point()
You can group by familyID, then use summarize() from the dplyr package to find the differences.
Also note the conversion of id, familyID, and status to factors, which may make life easier so they aren't confused with being integers.
library(dplyr)
library(forcats)
library(ggplot2)
d <- structure(list(id = as.factor(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20)),
familyID = as.factor(c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10)),
status = as.factor(c(0,1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1)),
value = c(29,26, 39, 22.3, 24, 41, 29.7, 24, 25.9, 21, 29,24,26,29, 15.2, 11, 35, 15.4,16, 13.4)),
class = c("tbl_df","tbl", "data.frame"), row.names = c(NA, -20L))
diffs <- group_by(d, familyID) %>%
summarize(., diff = (value[status == 0] - value[status == 1]))
Reordering the families by difference can help get a sense of the distribution of differences
diffs$familyID <- fct_reorder(diffs$familyID, diffs$diff, .desc = TRUE)
ggplot(diffs, aes(x = familyID, y = diff)) +
geom_bar(stat="identity")
If you really have a lot of families you may want to display a summary of the differences.
One option is with a histogram (modifying binwidth can control how fine the bins are):
ggplot(diffs, aes(x = diff)) +
geom_histogram(binwidth = 3)
Similar to a histogram is a density plot:
ggplot(diffs, aes(x = diff)) +
geom_density()
Finally, a boxplot is also a familiar summary. They're mostly meant for comparing multiple groups, but it works okay with just one. I've added the individual points using the geom_jitter() function.
ggplot(diffs, aes(y = diff)) + #If using multiple groups add x=group inside the aes() function.
geom_boxplot() +
geom_jitter(aes(x = 0))
I'm trying to run a logistic regression model in R to identify auxiliary variables which predict missingness in other variables to run a multiple imputation by chained equations model.
Below, var_1 is the missingness variable, computed from another existing variable (0 = not missing, 1 = missing).
For the model, I'm using a logistic regression to predict missingness (0,1) from other variables, and those which are predictive at the p<0.05 level will be used as auxiliary variables which predict missingness.
I'm aware this is not the method normally used to identify auxiliary variables for MICE but this is what I have been advised to do and cannot deviate from this method.
var_1 <- c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
var_2 <- c(14, 13, 16, 12, 11, 16, 8, 13, 14, 16, 11, 15, 13, 15, 15, 7, 15, 14, 14, 7, 16, 14, 12, 16, 12, 16, 12, 12, 11, 13, 16, 12, 13, 13, 12, 14, 12, 16, 10, 14, 16, 14, 16, 16, 12, 8, 15, 14, 14, 14, 12, 15, 10, 12, 10, 13, 14, 16, 11, 7, 14, 9, 15, 14, 13, 9, 16)
var_3 <- c(13, 10, 16, 8, 10, 13, 8, 16, 13, 12, 4, 3, 8, 11, 13, 8, 8, 13, 10, 9, 16, 13, 1, 14, 12, 14, 12, 10, 12, 11, 16, 9, 9, 5, 7, 14, 15, 16, 10, 8, 16, 12, 12, 7, 13, 4, 16, 13, NA, NA, NA, NA, NA, NA, NA, 0, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
var_4 <- c(16, 16, 12, 16, 13, 17, 15, 19, 12, 18, 15, 19, 15, NA, 17, 11, 10, 13, NA, 11, 18, 18, 11, 12, 11, 19, 10, 15, 17, 17, NA, 17, 15, 15, 17, 18, 15, 14, 11, 13, 14, 15, 20, 16, 12, 11, 17, 16, 11, 15, 15, 15, 11, 11, 9, 13, NA, 12, 13, 14, 13, 15, 19, 15, 15, 15, 16)
df <- cbind(var_1, var_2, var_3, var_4)
lm_1 <- glm(var_1 ~ var_2, data = df, family = binomial())
broom::tidy(lm_1, conf.int = TRUE)
lm_2 <- glm(var_1 ~ var_3, data = df, family = binomial())
broom::tidy(lm_2, conf.int = TRUE)
lm_3 <- glm(var_1 ~ var_4, data = df, family = binomial())
broom::tidy(lm_3, conf.int = TRUE)
All glm functions will compute, but for broom::tidy, lm_1 works, lm_2 doesn't, and lm_3 works (lm_3 was given as an example to show the model can handle some NA values).
I've figured out that this is most likely because var_1 has 0 values up until the 49th variable, and 1 from 49 onwards except for the 56th which is 0, whereas var_3 has na from the 50th onwards except for the 57th, which is 0. This means that the model cannot compute based on only 1 non-zero, non-NA value.
Is there an alternative model, method, package or function which I can utilise which will help me achieve my desired outcome? I can see that swapping the variables to 1 = not missing, 0 = missing will probably work - will this provide me with the correct result? I realise I'm probably overthinking this, I have been trying to deal with it a long time.
Thanks for taking the time to read this, I'd love any advice you might have!
The below details a simplified version of an optimization problem that I am having trouble solving.
The objective is to minimize a cost function for an organization that delivers water via trucks and use that equation to produce a truck delivery schedule that minimizes costs.
The organization delivers water to ~10,000 household tanks throughout the year.
The tanks have a maximum capacity of 300 gallons and minimum desired limit of 100 gallons -- that is, the tanks should be topped up to 300 before they go below 100.
For example, if the tank is 115 gallons on week 2 and is estimated to use 20 gallons in week 3, it needs to be refilled in week 3.
The costs include:
A per-delivery fee of $10
The weekly cost of trucks. The weekly cost of a truck is $1,000. So if 200 deliveries are in a single week the cost is $3,000 (200 * 10 + 1000 * 1).If 201 deliveries are made, the cost jumps significantly to $4,010 (201 * 10 + 1000 * 2).
Water usage varies across households and across weeks. Peak water usage is in the summer. If we blindly followed the rule to refill just before hitting the 100 gallon minimum limit, then it is likely that the peak number of trucks would be higher than needed if the deliveries were spread out into the "shoulders" of the summer.
I have created estimates of weekly water usage for each week for each household. Furthermore, I have grouped like households to reduce the size of the optimization problem (~10k households down to 8 groups).
To restate the goal: The output of this optimizer should be: deliver or not, for each household group, for each of the 52 weeks in a year.
Simplified data (i.e., for 8 groups and 12 weeks):
df.usage <- structure(list(reduction.group = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8), week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12), water_usage = c(46, 50, 42, 47, 43, 39,
38, 32, 42, 36, 42, 30, 46, 50, 42, 47, 43, 39, 38, 32, 42, 36,
42, 30, 46, 50, 43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 46, 50,
43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 29, 32, 27, 30, 27, 25,
24, 20, 26, 23, 27, 19, 29, 32, 27, 30, 27, 25, 24, 20, 26, 23,
27, 19, 29, 32, 27, 30, 28, 25, 25, 21, 27, 23, 27, 19, 29, 32,
27, 30, 28, 25, 25, 21, 27, 23, 27, 20), tank.level.start = c(115,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 165, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 200, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 215, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 225, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 230,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 235, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 240, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 96L), class = "data.frame")
Tank level refill rules
Here is a nested set of loops for determining tank levels over time with a "refill" logic:
library(dplyr)
reduction.groups <- unique(df.usage$reduction.group)
df.after.refill.logic <- list()
for (i in reduction.groups) {
temp <- df.usage %>% filter(reduction.group == i)
temp$refilled <- 0
temp$level <- temp$tank.level.start
n <- nrow(temp)
if (n > 1) for (j in 2:n) {
temp$level[j] <- ( temp$level[j-1] - temp$water_usage[j] )
if(temp$level[j] < 100) {
temp$level[j] <- 300
temp$refilled[j] <- 1
}
}
df.after.refill.logic <- bind_rows(df.after.refill.logic, temp)
}
Decision Variables
Delivery or not to each group, each week of the year (Binary)
Constraints
No partial trucks: number of trucks must be integers
Truck capacity: truck deliveries/week <= 200
Tanks can't go below 100 gallons: level >= 100
Delivery must be binary
Constants
1600 # truck_weekly_costs
10 # cost_per_delivery
200 # weekly_delivery_capacity_per_truck
Example Cost Function
weekly_cost_function <- function(i){
cost <- (ceiling(sum(i)/200)) * 1600 + (sum(i) * 10)
cost
}
**example cost for one week with i = 199 deliveries:**
weekly_cost_function(i = 199)
[1] 3590
Attempt to Model the Problem using OMPR
Below is the beginning of a model created with the OMPR package (although using another package would be okay):
I am confused about how to set this up using the data above.
Three obvious problems:
How can I include the ceiling logic expressed in the Example Cost Function in the OMPR code?
The model below isn't incorporating the data in the dataframe above (df.usage). The goal is for an optimizer to generate values for the "refilled" and "level" variables based on the four variables (reduction.group, week, water_usage, tank_level_start), along with the constants.
The refill logic I wrote in the "determining tank levels" loop above isn't incorporated. Should that be added as a constraint? If so, how?
num_groups <- length(unique(df.usage$reduction.group))
num_weeks <- length(unique(df.usage$week))
MIPModel() %>%
add_variable(x[i,w], # create decision variable: deliver or not by...
i = 1:num_groups, # group,
w = 1:num_weeks, # in week.
type = "integer", # Integers only
lb = 0, ub = 1) %>% # between 0 and 1, inclusive
set_objective(sum_expr( x[i,w]/200 * 1600 + x[i,w] * 10,
i = 1:num_groups,
w = 1:num_weeks),
sense = "min") %>%
# add constraint to achieve ceiling(x[i,w]/200), or should this be in the set_objective call?
add_constraint(???) %>%
solve_model(with_ROI("glpk"))
Desired Output
Here is what an example head() output would look like:
reduction.group week water.usage refill level
1 1 46 0 115
1 2 50 1 300
1 3 42 0 258
1 4 47 0 211
1 5 43 0 168
1 6 39 0 129
Importantly, the refill values would be whatever minimizes the cost function and keeps the level above 100.
The ceiling function is a difficult non-linear function (non-differentiable, not continuous), and should be avoided at all cost. However it can be modeled quite easily with general integer variables. For non-negative variables x >= 0 we can formulate
y = ceiling(x)
as
x <= y <= x+1
y integer
This is completely linear and is trivial to implement in OMPR (or in any other LP/MIP tool).
Detail note. This formulation will allow the model to choose y=x or y=x+1 in the special case where x assumes an integer value. If you want to be picky about this case, you can do:
x+0.0001 <= y <= x+1
y integer
I would not worry about this.
With the ceiling function, this seems like a difficult problem for a hill-climbing optimizer. I think genetic algorithms are a better fit. The matrix of deliver-or-not for each house each week makes a nice genome.
library(dplyr)
# Original given sample input data.
df.usage <- structure(list(reduction.group = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8), week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12), water_usage = c(46, 50, 42, 47, 43, 39,
38, 32, 42, 36, 42, 30, 46, 50, 42, 47, 43, 39, 38, 32, 42, 36,
42, 30, 46, 50, 43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 46, 50,
43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 29, 32, 27, 30, 27, 25,
24, 20, 26, 23, 27, 19, 29, 32, 27, 30, 27, 25, 24, 20, 26, 23,
27, 19, 29, 32, 27, 30, 28, 25, 25, 21, 27, 23, 27, 19, 29, 32,
27, 30, 28, 25, 25, 21, 27, 23, 27, 20), tank.level.start = c(115,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 165, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 200, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 215, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 225, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 230,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 235, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 240, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 96L), class = "data.frame")
# Orginal given delivery cost function.
weekly_cost_function <- function(i){
cost <- (ceiling(sum(i)/200)) * 1600 + (sum(i) * 10)
cost
}
# Calculate the list of houses (reduction.groups) and number of delivery weeks (weeks).
reduction.groups <- unique(df.usage$reduction.group)
temp <- df.usage %>% filter(reduction.group == 1)
weeks <- nrow(temp)
# The genome consists of a matrix representing deliver-or-not to each house each week.
create_random_delivery_schedule <- function(number_of_houses, number_of_weeks, prob = NULL) {
matrix(sample(c(0, 1), number_of_houses * number_of_weeks, replace = TRUE, prob = prob), number_of_houses)
}
# Generate a population of random genes.
population_size <- 100
schedules <- replicate(population_size, create_random_delivery_schedule(length(reduction.groups), weeks), simplify = FALSE)
# Calculate fitness of an individual.
fitness <- function(schedule) {
# Fitness is related to delivery cost.
delivery_cost <- sum(apply(schedule, 2, weekly_cost_function))
# If the schedule allows a tank level to drop below 100, apply a fitness penalty.
# Don't make the fitness penalty too large.
# If the fitness penalty is large enough to be catastrophic (essentially zero children)
# then solutions that are close to optimal will also be likely to generate children
# who fall off the catastropy cliff so there will be a selective pressure away from
# close to optimal solutions.
# However, if your optimizer generates a lot of infeasible solutions raise the penalty.
for (i in reduction.groups) {
temp <- df.usage %>% filter(reduction.group == i)
temp$level <- temp$tank.level.start
if (weeks > 1) for (j in 2:weeks) {
if (1 == schedule[i,j]) {
temp$level[j] <- 300
} else {
temp$level[j] <- ( temp$level[j-1] - temp$water_usage[j] )
if (100 > temp$level[j]) {
# Fitness penalty.
delivery_cost <- delivery_cost + 10 * (100 - temp$level[j])
}
}
}
}
# Return one over delivery cost so that lower cost is higher fitness.
1 / delivery_cost
}
# Generate a new schedule by combining two parents chosen randomly weighted by fitness.
make_baby <- function(population_fitness) {
# Choose some parents.
parents <- sample(length(schedules), 2, prob = population_fitness)
# Get DNA from mommy.
baby <- schedules[[parents[1]]]
# Figure out what part of the DNA to get from daddy.
house_range <- sort(sample(length(reduction.groups), 2))
week_range <- sort(sample(weeks, 2))
# Get DNA from daddy.
baby[house_range[1]:house_range[2],week_range[1]:week_range[2]] <- schedules[[parents[2]]][house_range[1]:house_range[2],week_range[1]:week_range[2]]
# Mutate, 1% chance of flipping each bit.
changes <- create_random_delivery_schedule(length(reduction.groups), weeks, c(0.99, 0.01))
baby <- apply(xor(baby, changes), c(1, 2), as.integer)
}
lowest_cost <<- Inf
# Loop creating and evaluating generations.
for (ii in 1:100) {
population_fitness <- lapply(schedules, fitness)
lowest_cost_this_generation <- 1 / max(unlist(population_fitness))
print(sprintf("lowest cost = %f", lowest_cost_this_generation))
if (lowest_cost_this_generation < lowest_cost) {
lowest_cost <<- lowest_cost_this_generation
best_baby <<- schedules[[which.max(unlist(population_fitness))]]
}
schedules <<- replicate(population_size, make_baby(population_fitness), simplify = FALSE)
}
In the code below I'm trying to train two different auto.arima models at the same time in parallel on different cores. I'm getting the error below when I try to run the code. I'm not sure if my issue is with do.call or parLapply, also I'm pretty new to parallel processing so any tips are very helpful.
Code:
library("forecast")
library("parallel")
TList2<-list(x=tsd1, lambda = Tlambda, stepwise=TRUE, approximation = TRUE)
DList2<-list(x=tsd2, lambda = Rlambda, stepwise=TRUE, approximation = TRUE)
##Parallelizing ARIMA Model Training
# Calculate the number of cores
no_cores <- 1
# Initiate cluster
cl <- makeCluster(no_cores)
ARIMA_List<-list(TList2,DList2)
ARIMA_Models<-parLapply(cl, ARIMA_List,
function(x){do.call(auto.arima, args=x)})
stopCluster(cl)
Error:
Error in checkForRemoteErrors(val) :
one node produced an error: object 'auto.arima' not found
Data:
dput(TList2)
structure(list(x = c(6, 15.5, 22, 16, NA, NA, 13, 13.5, 10, 6,
14.5, 16, NA, 8, 11, NA, 2, 2, 10, NA, 9, NA, 11, 16, NA, 4,
17, 7, 11.5, 22, 20.5, 10, 22, NA, 13, 17, 22, 9, 13, 19, 8,
16, 18, 22, 21, 14, 7, 20, 21.5, 17), lambda = 0.999958829041611,
stepwise = TRUE, approximation = TRUE), .Names = c("x", "lambda",
"stepwise", "approximation"))
dput(DList2)
structure(list(x = c(11, 4, 8, 11, 11, NA, 3, 2.5, 6, 11, 7,
1, NA, 6, 6, NA, 6, 11, 3, NA, 11, NA, 10, 10, NA, NA, 9, 3,
3, 11, 8, 10, NA, NA, 11, 10, 9, 3, 7, NA, 2, 4, 11, 2.5, 3,
NA, 4, 7, 1, 5), lambda = 0.170065851742339, stepwise = TRUE,
approximation = TRUE), .Names = c("x", "lambda", "stepwise",
"approximation"))
I think forecast::auto.arima should be available on the clusters, too, so try for example using clusterEvalQ like this:
TList2 <- structure(list(x = c(6, 15.5, 22, 16, NA, NA, 13, 13.5, 10, 6,
14.5, 16, NA, 8, 11, NA, 2, 2, 10, NA, 9, NA, 11, 16, NA, 4,
17, 7, 11.5, 22, 20.5, 10, 22, NA, 13, 17, 22, 9, 13, 19, 8,
16, 18, 22, 21, 14, 7, 20, 21.5, 17), lambda = 0.999958829041611,
stepwise = TRUE, approximation = TRUE), .Names = c("x", "lambda",
"stepwise", "approximation"))
DList2<- structure(list(x = c(11, 4, 8, 11, 11, NA, 3, 2.5, 6, 11, 7,
1, NA, 6, 6, NA, 6, 11, 3, NA, 11, NA, 10, 10, NA, NA, 9, 3,
3, 11, 8, 10, NA, NA, 11, 10, 9, 3, 7, NA, 2, 4, 11, 2.5, 3,
NA, 4, 7, 1, 5), lambda = 0.170065851742339, stepwise = TRUE,
approximation = TRUE), .Names = c("x", "lambda", "stepwise",
"approximation"))
library("forecast")
library("parallel")
cl <- makeCluster(no_cores)
clusterEvalQ(cl, library(forecast))
ARIMA_List<-list(TList2,DList2)
ARIMA_Models<-parLapply(cl, ARIMA_List,
function(x){do.call(auto.arima, args=x)})
stopCluster(cl)