Optimization in R: cost function with binary scheduling variables? - r

The below details a simplified version of an optimization problem that I am having trouble solving.
The objective is to minimize a cost function for an organization that delivers water via trucks and use that equation to produce a truck delivery schedule that minimizes costs.
The organization delivers water to ~10,000 household tanks throughout the year.
The tanks have a maximum capacity of 300 gallons and minimum desired limit of 100 gallons -- that is, the tanks should be topped up to 300 before they go below 100.
For example, if the tank is 115 gallons on week 2 and is estimated to use 20 gallons in week 3, it needs to be refilled in week 3.
The costs include:
A per-delivery fee of $10
The weekly cost of trucks. The weekly cost of a truck is $1,000. So if 200 deliveries are in a single week the cost is $3,000 (200 * 10 + 1000 * 1).If 201 deliveries are made, the cost jumps significantly to $4,010 (201 * 10 + 1000 * 2).
Water usage varies across households and across weeks. Peak water usage is in the summer. If we blindly followed the rule to refill just before hitting the 100 gallon minimum limit, then it is likely that the peak number of trucks would be higher than needed if the deliveries were spread out into the "shoulders" of the summer.
I have created estimates of weekly water usage for each week for each household. Furthermore, I have grouped like households to reduce the size of the optimization problem (~10k households down to 8 groups).
To restate the goal: The output of this optimizer should be: deliver or not, for each household group, for each of the 52 weeks in a year.
Simplified data (i.e., for 8 groups and 12 weeks):
df.usage <- structure(list(reduction.group = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8), week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12), water_usage = c(46, 50, 42, 47, 43, 39,
38, 32, 42, 36, 42, 30, 46, 50, 42, 47, 43, 39, 38, 32, 42, 36,
42, 30, 46, 50, 43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 46, 50,
43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 29, 32, 27, 30, 27, 25,
24, 20, 26, 23, 27, 19, 29, 32, 27, 30, 27, 25, 24, 20, 26, 23,
27, 19, 29, 32, 27, 30, 28, 25, 25, 21, 27, 23, 27, 19, 29, 32,
27, 30, 28, 25, 25, 21, 27, 23, 27, 20), tank.level.start = c(115,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 165, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 200, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 215, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 225, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 230,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 235, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 240, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 96L), class = "data.frame")
Tank level refill rules
Here is a nested set of loops for determining tank levels over time with a "refill" logic:
library(dplyr)
reduction.groups <- unique(df.usage$reduction.group)
df.after.refill.logic <- list()
for (i in reduction.groups) {
temp <- df.usage %>% filter(reduction.group == i)
temp$refilled <- 0
temp$level <- temp$tank.level.start
n <- nrow(temp)
if (n > 1) for (j in 2:n) {
temp$level[j] <- ( temp$level[j-1] - temp$water_usage[j] )
if(temp$level[j] < 100) {
temp$level[j] <- 300
temp$refilled[j] <- 1
}
}
df.after.refill.logic <- bind_rows(df.after.refill.logic, temp)
}
Decision Variables
Delivery or not to each group, each week of the year (Binary)
Constraints
No partial trucks: number of trucks must be integers
Truck capacity: truck deliveries/week <= 200
Tanks can't go below 100 gallons: level >= 100
Delivery must be binary
Constants
1600 # truck_weekly_costs
10 # cost_per_delivery
200 # weekly_delivery_capacity_per_truck
Example Cost Function
weekly_cost_function <- function(i){
cost <- (ceiling(sum(i)/200)) * 1600 + (sum(i) * 10)
cost
}
**example cost for one week with i = 199 deliveries:**
weekly_cost_function(i = 199)
[1] 3590
Attempt to Model the Problem using OMPR
Below is the beginning of a model created with the OMPR package (although using another package would be okay):
I am confused about how to set this up using the data above.
Three obvious problems:
How can I include the ceiling logic expressed in the Example Cost Function in the OMPR code?
The model below isn't incorporating the data in the dataframe above (df.usage). The goal is for an optimizer to generate values for the "refilled" and "level" variables based on the four variables (reduction.group, week, water_usage, tank_level_start), along with the constants.
The refill logic I wrote in the "determining tank levels" loop above isn't incorporated. Should that be added as a constraint? If so, how?
num_groups <- length(unique(df.usage$reduction.group))
num_weeks <- length(unique(df.usage$week))
MIPModel() %>%
add_variable(x[i,w], # create decision variable: deliver or not by...
i = 1:num_groups, # group,
w = 1:num_weeks, # in week.
type = "integer", # Integers only
lb = 0, ub = 1) %>% # between 0 and 1, inclusive
set_objective(sum_expr( x[i,w]/200 * 1600 + x[i,w] * 10,
i = 1:num_groups,
w = 1:num_weeks),
sense = "min") %>%
# add constraint to achieve ceiling(x[i,w]/200), or should this be in the set_objective call?
add_constraint(???) %>%
solve_model(with_ROI("glpk"))
Desired Output
Here is what an example head() output would look like:
reduction.group week water.usage refill level
1 1 46 0 115
1 2 50 1 300
1 3 42 0 258
1 4 47 0 211
1 5 43 0 168
1 6 39 0 129
Importantly, the refill values would be whatever minimizes the cost function and keeps the level above 100.

The ceiling function is a difficult non-linear function (non-differentiable, not continuous), and should be avoided at all cost. However it can be modeled quite easily with general integer variables. For non-negative variables x >= 0 we can formulate
y = ceiling(x)
as
x <= y <= x+1
y integer
This is completely linear and is trivial to implement in OMPR (or in any other LP/MIP tool).
Detail note. This formulation will allow the model to choose y=x or y=x+1 in the special case where x assumes an integer value. If you want to be picky about this case, you can do:
x+0.0001 <= y <= x+1
y integer
I would not worry about this.

With the ceiling function, this seems like a difficult problem for a hill-climbing optimizer. I think genetic algorithms are a better fit. The matrix of deliver-or-not for each house each week makes a nice genome.
library(dplyr)
# Original given sample input data.
df.usage <- structure(list(reduction.group = c(1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8), week = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9,
10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12), water_usage = c(46, 50, 42, 47, 43, 39,
38, 32, 42, 36, 42, 30, 46, 50, 42, 47, 43, 39, 38, 32, 42, 36,
42, 30, 46, 50, 43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 46, 50,
43, 47, 43, 39, 38, 32, 42, 36, 42, 30, 29, 32, 27, 30, 27, 25,
24, 20, 26, 23, 27, 19, 29, 32, 27, 30, 27, 25, 24, 20, 26, 23,
27, 19, 29, 32, 27, 30, 28, 25, 25, 21, 27, 23, 27, 19, 29, 32,
27, 30, 28, 25, 25, 21, 27, 23, 27, 20), tank.level.start = c(115,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 165, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 200, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, 215, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 225, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 230,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 235, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 240, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA)), row.names = c(NA, 96L), class = "data.frame")
# Orginal given delivery cost function.
weekly_cost_function <- function(i){
cost <- (ceiling(sum(i)/200)) * 1600 + (sum(i) * 10)
cost
}
# Calculate the list of houses (reduction.groups) and number of delivery weeks (weeks).
reduction.groups <- unique(df.usage$reduction.group)
temp <- df.usage %>% filter(reduction.group == 1)
weeks <- nrow(temp)
# The genome consists of a matrix representing deliver-or-not to each house each week.
create_random_delivery_schedule <- function(number_of_houses, number_of_weeks, prob = NULL) {
matrix(sample(c(0, 1), number_of_houses * number_of_weeks, replace = TRUE, prob = prob), number_of_houses)
}
# Generate a population of random genes.
population_size <- 100
schedules <- replicate(population_size, create_random_delivery_schedule(length(reduction.groups), weeks), simplify = FALSE)
# Calculate fitness of an individual.
fitness <- function(schedule) {
# Fitness is related to delivery cost.
delivery_cost <- sum(apply(schedule, 2, weekly_cost_function))
# If the schedule allows a tank level to drop below 100, apply a fitness penalty.
# Don't make the fitness penalty too large.
# If the fitness penalty is large enough to be catastrophic (essentially zero children)
# then solutions that are close to optimal will also be likely to generate children
# who fall off the catastropy cliff so there will be a selective pressure away from
# close to optimal solutions.
# However, if your optimizer generates a lot of infeasible solutions raise the penalty.
for (i in reduction.groups) {
temp <- df.usage %>% filter(reduction.group == i)
temp$level <- temp$tank.level.start
if (weeks > 1) for (j in 2:weeks) {
if (1 == schedule[i,j]) {
temp$level[j] <- 300
} else {
temp$level[j] <- ( temp$level[j-1] - temp$water_usage[j] )
if (100 > temp$level[j]) {
# Fitness penalty.
delivery_cost <- delivery_cost + 10 * (100 - temp$level[j])
}
}
}
}
# Return one over delivery cost so that lower cost is higher fitness.
1 / delivery_cost
}
# Generate a new schedule by combining two parents chosen randomly weighted by fitness.
make_baby <- function(population_fitness) {
# Choose some parents.
parents <- sample(length(schedules), 2, prob = population_fitness)
# Get DNA from mommy.
baby <- schedules[[parents[1]]]
# Figure out what part of the DNA to get from daddy.
house_range <- sort(sample(length(reduction.groups), 2))
week_range <- sort(sample(weeks, 2))
# Get DNA from daddy.
baby[house_range[1]:house_range[2],week_range[1]:week_range[2]] <- schedules[[parents[2]]][house_range[1]:house_range[2],week_range[1]:week_range[2]]
# Mutate, 1% chance of flipping each bit.
changes <- create_random_delivery_schedule(length(reduction.groups), weeks, c(0.99, 0.01))
baby <- apply(xor(baby, changes), c(1, 2), as.integer)
}
lowest_cost <<- Inf
# Loop creating and evaluating generations.
for (ii in 1:100) {
population_fitness <- lapply(schedules, fitness)
lowest_cost_this_generation <- 1 / max(unlist(population_fitness))
print(sprintf("lowest cost = %f", lowest_cost_this_generation))
if (lowest_cost_this_generation < lowest_cost) {
lowest_cost <<- lowest_cost_this_generation
best_baby <<- schedules[[which.max(unlist(population_fitness))]]
}
schedules <<- replicate(population_size, make_baby(population_fitness), simplify = FALSE)
}

Related

Reverse score a subset of items based on variable predefined maximums

I have longitudinal data for which I would like to reverse score a subset of items using corresponding predefined maximum scores that are stored in a seperate data frame.
In the below example data (df) there are three scores, DST, SOS, and VR at two timepoints (baseline and wave 1). neg_skew.vars contains the scores that are to be reverse across timepoints. I would like to reverse scores based on the maximum possible value for that score, as stored in df.CP1.vars$max.vars. I'd like this to work when multiple scores with different maximum values are included in neg_skew.vars.
For example, in the example below "SOS.score" is stored in neg_skew.vars. Therefore I want all SOS.Score variables to be reversed (i.e., across timepoints); this would include 'SOS.Score.baseline' and 'SOS.Score.wave1' in the example data below. I want scores to be reversed using the corresponding maximum score for SOS. For each SOS variable, I want each value to be reversed like this: (20 + 1) - value. The 20 corresponds to the maximum value for SOS stored in df.CP1.vars. As DST is also negatively skewed, all DST scores (i.e., 'DST.Score.baseline' and 'DST.Score.wave1') should be reveresed, but with 16 as the maximum value, per df.CP1.vars, so: (16 + 1) - value. This results in the desired data frame df_wanted below. VR.Score does not appear in neg_skew.vars and so no VR.Score variables are reversed (i.e., VR.Score.baseline and VR.Score.wave1).
So far I have the code listed below under # reverse scores however this produces two undesired outcomes in the resulting data frame (i.e., df2). These are A) the columns for other scores, such as DST, are not retained, and B) the maximum value used to reverse items is the maximum value for that item/at that timepoint; this is a problem as the data is longitudinal.
The desired data should look like df_wanted. I tried to set up a for-loop but ran into problems with using the dplyr pipeline.
# required packages
library(dplyr)
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")
# reverse scores
df2 <- df %>%
select(contains(neg_skew.vars)) %>%
mutate(across(everything(), ~ max(., na.rm = TRUE) + 1 - . , .names = "{.col}_r"))
# desired outcome (order of variables irrelevant)
df_wanted <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
SOS.Score.baseline_r = c(17, 10, 14, 12, 11, 13, 15, 13, 14, 21, 12, 11),
SOS.Score.wave1_r = c(NA, 13.5, 12.5, NA, NA, 14.34, NA, 15, 13, 13, 14, 13),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
DST.Score.baseline_r = c(6, 7, 9, 9, 9, 9, 8, 8, 10, 11, 10, 11),
DST.Score.wave1_r = c(NA, 7, 8.5, NA, NA, 9, NA, 7.67, 8, 10, 9, 9),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA,12L), class = "data.frame")
You can use purrr::map_dfc to loop over the neg_skew.vars and get the value directly from df.CP1.vars, and then bind the resulting dataframe with columns that remained unchanged.
library(tidyverse)
library(purrr)
df2 <- neg_skew.vars %>%
map_dfc(function(a) df %>%
select(matches(a)) %>%
mutate(across(everything(), ~ df.CP1.vars$max.vars[df.CP1.vars$CP1.vars == a] + 1 - .,
.names = "{.col}_r"))) %>%
bind_cols(df %>%
select(!contains(neg_skew.vars)))
This indeed leads to the desired outcome:
identical(df2, df_wanted)
#[1] TRUE
Data:
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")

R MICE Impute missing data conditionally on group allocation

I am using mice in R to impute missing data for an intervention study. The data I want to impute are continuous. I have two observation groups; 1= the intervention group, 2= the control group.
There are some variables that I only want mice to impute for the intervention group. For the control group, the NA's should be remaining.
Here is a toy version of my data set:
toy2 <- data.frame(group= c(1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2),
PSS_t1=c(16, 15, 17, 22, 30, 12, 18, 19, 20, 14, 21, 23, 27, 20, 14),
CSQ_t2= c(8, 9, NA, 12, NA, 7, NA, 9, NA, NA, 10, 11, NA, 6, NA),
PSS_t3=c(15, 14, 15, 19, 22, 7, 19, 21, 14, 18, 12, 10, 22, 17, 14),
PSS_t4=c(1, 1, NA, 1, NA, 1, NA, 1, NA, NA, 1, 1, NA, 1, NA))
The scales CSQ_t2 and PSS_t4 have only been presented to the intervention group, so NA's for the control group are natural.
I thought about using post from mice, unfortunately it does not work:
posttoy2 <- mice(toy2, m = 1, maxit = 0)$post
posttoy2["CSQ_t2"] <- "ifdo(group==1)"
posttoy2["PSS_t4"] <- "ifdo(group==1)"
imptoy2 <- mice(toy2, m = 1, post = posttoy2, seed = 555,
print = TRUE)
Here is the output I get:
> toy2 <- data.frame(group= c(1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2),
+ PSS_t1=c(16, 15, 17, 22, 30, 12, 18, 19, 20, 14, 21, 23, 27, 20, 14),
+ CSQ_t2= c(8, 9, NA, 12, NA, 7, NA, 9, NA, NA, 10, 11, NA, 6, NA),
+ PSS_t3=c(15, 14, 15, 19, 22, 7, 19, 21, 14, 18, 12, 10, 22, 17, 14),
+ PSS_t4=c(1, 1, NA, 1, NA, 1, NA, 1, NA, NA, 1, 1, NA, 1, NA))
> posttoy2 <- mice(toy2, m = 1, maxit = 0)$post
Warning message:
Number of logged events: 1
> posttoy2["CSQ_t2"] <- "ifdo(group_unguided==1)"
> posttoy2["CSQ_t2"] <- "ifdo(group==1)"
> posttoy2["PSS_t4"] <- "ifdo(group==1)"
> posttoy2
group PSS_t1 CSQ_t2 PSS_t3 PSS_t4
"" "" "ifdo(group==1)" "" "ifdo(group==1)"
> imptoy2 <- mice(toy2, m = 1, post = posttoy2, seed = 555,
+ print = TRUE)
iter imp variable
1 1 CSQ_t2Function ifdo() not yet implemented.
2 1 CSQ_t2Function ifdo() not yet implemented.
3 1 CSQ_t2Function ifdo() not yet implemented.
4 1 CSQ_t2Function ifdo() not yet implemented.
5 1 CSQ_t2Function ifdo() not yet implemented.
Warning message:
Number of logged events: 6
Has someone any idea how I can realize this conditional imputation?
Many thanks in advance for your ideas and help! :-)

How to get svydb R package for large survey data sets to return standard errors

Excited to see Charco Hui resurrected Thomas Lumley's experimental sqlsurvey package as svydb. Potentially a great tool for working with large survey data sets in R.
Have though run into some problems with svydb calculating standard errors. This issue is very similar to one I encountered with sqlsurvey. The analogous procedures work fine in regular survey working on a Mac Pro running Mojave, R version 3.5.1, and MonetDB Aug 2018-SP1 Release
The package is potentially a very important tool for health researchers and epidemiologists, so hoping I am doing something obviously wrong and will be easy to sort this out.
Below is a reproducible example based on some random draws from a BRFSS data set.
# create data
dat<-data.frame(var1=c(6, 5, 6, 6, 6, 6, 3, 3, 2, 6, 3, 4, 6, 6, 3, 6, 5, 5, 4, 5, 4, 4, 6, 6, 4, 3, 4, 4, 3, 5, 5, 3, 6, 1, 6, 3, 5, 3, 4, 2, 5, 2, 6, 5, 6, 2, 2, 4, 3, 6, 6, 6, 3, 6, 2, 3, 4, 1, 4, 4, 2, 6, 5, 4, 6, 2, 6, 2, 6, 6, 6, 5, 4, 2, 5, 3, 4, 4, 5, 1, 2, 6, 6, 6, 2, 6, 4, 4, 2, 3, 6, 3, 3, 6, 2, 6, 5, 4, 6, 2),
var2=c(NA, NA, 88, 88, 88, NA, 1, 88, 88, NA, NA, 2, NA, NA, 88, NA, NA, 88, NA, 88, 88, 88, 88, NA, 88, 88, 88, NA, 88, NA, NA, 88, NA, NA, NA, 88, NA, NA, 4, NA, 88, NA, NA, 8, NA, NA, NA, 88, 88, NA, NA, NA, NA, NA, 88, NA, 88, 88, NA, 88, NA, 88, NA, 11, NA, 1, NA, 4, 88, 7, NA, 88, NA, NA, 88, 88, NA, 88, NA, 88, 1, 88, 88, NA, 88, NA, NA, 88, 29, 88, 88, NA, 88, NA, 88, NA, 88, NA, NA, 88),
strat=c(25011, 12032, 19101, 22011, 24011, 53311, 8011, 25061, 8012, 35031, 32022, 18022, 34101, 19061, 5011, 24011, 34101, 24011, 19021, 31031, 37111, 5012, 25011, 53201, 15052, 22011, 37261, 27011, 30031, 44031, 26011, 42011, 37121, 25041, 39082, 24032, 38011, 15012, 24012, 35062, 16071, 42042, 27011, 45062, 50011, 25041, 56012, 25041, 9022, 55031, 55021, 19091, 35022, 28011, 34101, 40021, 9021, 23031, 34041, 78021, 2021, 30021, 1011, 9012, 40032, 18042, 20011, 49041, 24011, 53111, 16012, 20011, 22011, 25042, 49122, 53352, 53091, 9021, 19071, 29021, 18021, 37131, 46041, 8011, 42031, 47121, 46051, 17011, 42021, 72061, 34101, 25062, 35062, 37061, 55062, 46031, 45041, 28011, 37211, 12021),
psu=c(2006092024, 2006018204, 2006024879, 2006009844, 2006054713, 2006074840, 2006014252, 2006022459, 2006030518, 2006033744, 2006004188, 2006029456, 2006028186, 2006021413, 2006049813, 2006007831, 2006166137, 2006030867, 2006027180, 2006027225, 2006029188, 2006004573, 2006000994, 2006064527, 2006011204, 2006007744, 2006016792, 2006020891, 2006039237, 2006007268, 2006019976, 2006065941, 2006074498, 2006008616, 2006018805, 2006046669, 2006018600, 2006013675, 2006075607, 2006008605, 2006020846, 2006022428, 2006033687, 2006020260, 2006035555, 2006009957, 2006006278, 2006103150, 2006048853, 2006015663, 2006001191, 2006023333, 2006002411, 2006035682, 2006041137, 2006011947, 2006042893, 2006025836, 2006099337, 2006036076, 2006016300, 2006028942, 2006013850, 2006064497, 2006026877, 2006020508, 2006020272, 2006023852, 2006012831, 2006051597, 2006033700, 2006044908, 2006003444, 2006072406, 2006021862, 2006081285, 2006133751, 2006019507, 2006031458, 2006019504, 2006002030, 2006075571, 2006020267, 2006040619, 2006125145, 2006008496, 2006051043, 2006031048, 2006106217, 2006004721, 2006148449, 2006017795, 2006008827, 2006003243, 2006009762, 2006044530, 2006029068, 2006019002, 2006013326, 2006015464),
wt= c(131.167091, 1135.222016, 462.911082, 237.434588, 222.090249, 8.867523, 367.057462, 635.047502, 734.512583, 168.266313, 237.236438, 794.324159, 896.016179, 453.886381, 516.676601, 222.090249, 370.880284, 266.860031, 178.231943, 127.053275, 286.216069, 212.409224, 196.986648, 52.223519, 13.850239, 176.044166, 290.372567, 517.028007, 79.796161, 174.701770, 829.008988, 2514.492945, 177.300719, 190.845451, 270.146986, 261.739684, 45.144276, 276.447180, 1704.022745, 102.473540, 168.798958, 42.614573, 321.460974, 139.006541, 22.110006, 50.144119, 60.941799, 42.225301, 704.172192, 1094.072745, 423.243864, 424.056478, 256.796474, 222.921805, 940.041214, 191.560779, 280.303505, 959.310457, 401.479694, 27.638152, 84.411858, 111.955653, 661.796967, 72.884007, 118.783933, 1905.071998, 214.932923, 539.094181, 415.230283, 91.327212, 217.747861, 366.950758, 378.620855, 150.430028, 594.726292, 189.966155, 74.510093, 280.303505, 474.297391, 5389.787617, 753.892676, 502.921582, 59.307656, 444.877512, 51.007590, 446.469588, 32.298617, 2518.461812, 76.319144, 819.303078, 396.857545, 476.354501, 97.705354, 168.923737, 1179.296136, 34.909445, 275.003257, 410.288302, 147.335207, 723.221948))
# analysis in survey works
library(survey)
options(survey.lonely.psu="remove")
svyDat1<-
svydesign(
id=~psu,
strata=~strat,
data=dat,
weights=~wt,
)
svymean(~var1, svyDat1, se=T, na.rm=T)
# mean SE
# numVar 79.286 0.3439
svymean(~var2, svyDat1, se=T, na.rm=T)
# mean SE
# var2 70.171 1.0314
# analysis in svydb does not work
library(devtools)
install_github("chrk623/svydb")
library(svydb)
svyDat2 = svydbdesign(st="strat", wt="wt", id="psu", data=dat)
svydbmean(x = var1, design = svyDat2 , num = T)
# Mean SE
# var1 3.6063 NA
svydbmean(x = var2, design = svyDat2 , num = T)
# Mean SE
# var2 70.171 NA
# try with database connection returns more informative (?) error message
require(MonetDBLite)
require(DBI)
require(dbplyr)
con = dbConnect(MonetDBLite())
dbWriteTable(con, "dat", dat)
dat.db = tbl(con, "dat")
svyDat3 = svydbdesign(st = strat , wt = wt,id = psu , data = dat.db)
svydbmean(x = var1, design = svyDat3 , num = T)
# Error in .local(conn, statement, ...) :
# Unable to execute statement 'CREATE TEMPORARY TABLE "auugiyqyip" AS SELECT "sum(scaled)"
# FROM (SELECT SUM("scaled") AS "sum(scale...'.
# Server says 'MALException:batcalc./:22012!division by zero. '.
# In addition: There were 12 warnings (use warnings() to see them)
svydbmean(x = var2, design = svyDat3 , num = T)
# Error in .local(conn, statement, ...) :
# Unable to execute statement 'CREATE TEMPORARY TABLE "yxqvidpdbx" AS SELECT "sum(scaled)"
# FROM (SELECT SUM("scaled") AS "sum(scale...'.
# Server says 'MALException:batcalc./:22012!division by zero. '.
As noted by Charco Hui, the author of the program, svydb has been updated to allow svydb.lonely.psu = "remove" as an option. This solved the problem.

Select rows with complete.cases in more than one column in R [duplicate]

This question already has answers here:
Remove rows with all or some NAs (missing values) in data.frame
(18 answers)
Closed 4 years ago.
Let's start with some data to make the example reproducible:
x <- structure(list(DC1 = c(5, 5, NA, 5, 4, 6, 5, NA, 4, 6, 6, 6,
5, NA, 5, 5, 7), DC2 = c(4, 7, 4, 5, NA, 4, 6, 4, 4, 5, 5, 5,
5, NA, 6, 5, 5), DC3 = c(4, 7, 4, 4, NA, 4, 5, 4, 5, 4, 5, 5,
6, 4, 6, 6, 5), DC4 = c(4, 7, 5, NA, NA, 4, 6, 5, 5, 4, 3, 4,
6, 5, 5, 6, 3), DC5 = c(7, 8, 5, NA, NA, 10, 7, 6, 8, 6, 6, 7,
11, 10, 5, 7, 6), DC6 = c(8, 8, NA, NA, NA, 11, 9, 8, 9, 9, 10,
10, 12, 16, 6, 8, 9), DC7 = c(10, 10, 10, NA, NA, 8, 9, 8, 13,
8, 11, 9, 14, 13, 8, 8, 11), DC8 = c(17, 10, 10, NA, NA, 10,
10, 10, 15, 10, 14, 11, 23, 15, 14, 13, 14), DC9 = c(16, 9, 9,
NA, NA, 12, 13, 11, 13, 15, 15, 13, 17, 15, 25, 17, 12)), .Names = c("DC1",
"DC2", "DC3", "DC4", "DC5", "DC6", "DC7", "DC8", "DC9"), class = "data.frame", row.names = c(NA,
-17L))
How can I filter the data frame, keeping rows that contain data from column DC3 to DC10?
Here's a dplyr option:
library(dplyr)
x %>%
filter_at(vars(DC3:DC9), all_vars(!is.na(.)))
or:
x %>%
filter_at(vars(DC3:DC9), all_vars(complete.cases(.)))
and here's a tidyr option:
x %>% tidyr::drop_na(DC3:DC9)
We can subset the data and apply complete.cases
x[complete.cases(x[3:9]),]
or using column names
x[complete.cases(x[paste0("DC", 3:9)]),]
You could use function str_extract from package stringr, which can extract the number from the column name in the data frame.
# get number from column name
col_num <- as.numeric(stringr::str_extract(names(x), "\\d"))
# rows that contain data from column DC3 to DC10
x[(col_num >= 3) & (col_num < 10)]
Edited note:
To Install stringr please use install.packages("stringr")

Training Multiple Auto.Arima Models in Parallel

In the code below I'm trying to train two different auto.arima models at the same time in parallel on different cores. I'm getting the error below when I try to run the code. I'm not sure if my issue is with do.call or parLapply, also I'm pretty new to parallel processing so any tips are very helpful.
Code:
library("forecast")
library("parallel")
TList2<-list(x=tsd1, lambda = Tlambda, stepwise=TRUE, approximation = TRUE)
DList2<-list(x=tsd2, lambda = Rlambda, stepwise=TRUE, approximation = TRUE)
##Parallelizing ARIMA Model Training
# Calculate the number of cores
no_cores <- 1
# Initiate cluster
cl <- makeCluster(no_cores)
ARIMA_List<-list(TList2,DList2)
ARIMA_Models<-parLapply(cl, ARIMA_List,
function(x){do.call(auto.arima, args=x)})
stopCluster(cl)
Error:
Error in checkForRemoteErrors(val) :
one node produced an error: object 'auto.arima' not found
Data:
dput(TList2)
structure(list(x = c(6, 15.5, 22, 16, NA, NA, 13, 13.5, 10, 6,
14.5, 16, NA, 8, 11, NA, 2, 2, 10, NA, 9, NA, 11, 16, NA, 4,
17, 7, 11.5, 22, 20.5, 10, 22, NA, 13, 17, 22, 9, 13, 19, 8,
16, 18, 22, 21, 14, 7, 20, 21.5, 17), lambda = 0.999958829041611,
stepwise = TRUE, approximation = TRUE), .Names = c("x", "lambda",
"stepwise", "approximation"))
dput(DList2)
structure(list(x = c(11, 4, 8, 11, 11, NA, 3, 2.5, 6, 11, 7,
1, NA, 6, 6, NA, 6, 11, 3, NA, 11, NA, 10, 10, NA, NA, 9, 3,
3, 11, 8, 10, NA, NA, 11, 10, 9, 3, 7, NA, 2, 4, 11, 2.5, 3,
NA, 4, 7, 1, 5), lambda = 0.170065851742339, stepwise = TRUE,
approximation = TRUE), .Names = c("x", "lambda", "stepwise",
"approximation"))
I think forecast::auto.arima should be available on the clusters, too, so try for example using clusterEvalQ like this:
TList2 <- structure(list(x = c(6, 15.5, 22, 16, NA, NA, 13, 13.5, 10, 6,
14.5, 16, NA, 8, 11, NA, 2, 2, 10, NA, 9, NA, 11, 16, NA, 4,
17, 7, 11.5, 22, 20.5, 10, 22, NA, 13, 17, 22, 9, 13, 19, 8,
16, 18, 22, 21, 14, 7, 20, 21.5, 17), lambda = 0.999958829041611,
stepwise = TRUE, approximation = TRUE), .Names = c("x", "lambda",
"stepwise", "approximation"))
DList2<- structure(list(x = c(11, 4, 8, 11, 11, NA, 3, 2.5, 6, 11, 7,
1, NA, 6, 6, NA, 6, 11, 3, NA, 11, NA, 10, 10, NA, NA, 9, 3,
3, 11, 8, 10, NA, NA, 11, 10, 9, 3, 7, NA, 2, 4, 11, 2.5, 3,
NA, 4, 7, 1, 5), lambda = 0.170065851742339, stepwise = TRUE,
approximation = TRUE), .Names = c("x", "lambda", "stepwise",
"approximation"))
library("forecast")
library("parallel")
cl <- makeCluster(no_cores)
clusterEvalQ(cl, library(forecast))
ARIMA_List<-list(TList2,DList2)
ARIMA_Models<-parLapply(cl, ARIMA_List,
function(x){do.call(auto.arima, args=x)})
stopCluster(cl)

Resources