I am trying to fit many nonlinear fits using wrapnls in parallel using dplyr and broom (and eventually mclapply), but I am getting a parsing evaluation error from nlxb:
Error in parse(text = joe) (from #11) : <text>:1:6: unexpected input
1: b1.10% <- 20
I get this error using both do and lapply approaches.
library(nlmrt)
library(dplyr)
library(purrr)
library(broom)
data_frame(x = seq(0, 200, 0.1),
y = 1.2*exp(-(times - 10)^2/(2*4.2^2)) + 2.4*exp(-(times - 50)^2/(2*3.8^2)) + 5.3*exp(-(times - 80)^2/(2*5.1^2)) + rnorm(length(times), sd = 0.05)) %>%
do({
xl <- quantile(.$x, 0.1, na.rm = TRUE)
xm <- quantile(.$x, 0.5, na.rm = TRUE)
xh <- quantile(.$x, 0.8, na.rm = TRUE)
starts <- c(a1 = 5, a2 = 5, a3 = 5,
b1 = xl, b2 = xm, b3 = xh,
c1 = 5, c2 = 5, c3 = 5)
fmla <- y ~ a1*exp(-(x - b1)^2/(2*c1^2)) + a2*exp(-(x - b2)^2/(2*c2^2)) + a3*exp(-(x - b3)^2/(2*c3^2))
df <- data_frame(x = .$x, y = .$y)
mod <- wrapnls(fmla, lower = 0, upper = 200, start = starts, data = df)
tidy(mod)
})
Is there any way around this?
The problem isn't with the do aspect, it's the code inside the do, so you can debug that part directly. The starts vector is getting the b# names concatenated with the quantiles:
names(starts)
## [1] "a1" "a2" "a3" "b1.10%" "b2.50%" "b3.80%" "c1" "c2" "c3"
Adding unname to the quantile calculation fixes the issue.
data_frame(x = seq(0, 200, 0.1),
y = 1.2*exp(-(x - 10)^2/(2*4.2^2)) + 2.4*exp(-(x - 50)^2/(2*3.8^2)) + 5.3*exp(-(x - 80)^2/(2*5.1^2)) + rnorm(length(x), sd = 0.05)) %>%
do({
xl <- quantile(.$x, 0.1, na.rm = TRUE) %>% unname()
xm <- quantile(.$x, 0.5, na.rm = TRUE) %>% unname()
xh <- quantile(.$x, 0.8, na.rm = TRUE) %>% unname()
starts <- c(a1 = 5, a2 = 5, a3 = 5,
b1 = xl, b2 = xm, b3 = xh,
c1 = 5, c2 = 5, c3 = 5)
fmla <- y ~ a1*exp(-(x - b1)^2/(2*c1^2)) + a2*exp(-(x - b2)^2/(2*c2^2)) + a3*exp(-(x - b3)^2/(2*c3^2))
df <- data_frame(x = .$x, y = .$y)
mod <- wrapnls(fmla, lower = 0, upper = 200, start = starts, data = df)
tidy(mod)
})
## term estimate std.error statistic p.value
## 1 a1 2.386492 0.007455097 320.1155 0
## 2 a2 5.296250 0.006437509 822.7174 0
## 3 a3 1.199384 0.007132559 168.1562 0
## 4 b1 49.997697 0.013702894 3648.6960 0
## 5 b2 80.004023 0.007150546 11188.5193 0
## 6 b3 10.077847 0.028644821 351.8209 0
## 7 c1 3.798829 0.013702940 277.2273 0
## 8 c2 5.094727 0.007150573 712.4921 0
## 9 c3 4.175235 0.028944448 144.2499 0
Related
If I have a GLM, is there any way I can efficiently find the maximum output by changing one covariate and holding the others?
Using my simulated data:
# FUNCTIONS ====================================================================
logit <- function(p){
x = log(p/(1-p))
x
}
sigmoid <- function(x){
p = 1/(1 + exp(-x))
p
}
beta_duration <- function(D, select){
logit(
switch(select,
0.05 + 0.9 / (1 + exp(-2*D + 25)),
0.9 * exp(-exp(-0.5 * (D - 11))),
0.9 * exp(-exp(-(D - 11))),
0.9 * exp(-2 * exp(-(D - 9))),
sigmoid(0.847 + 0.210 * (D - 10)),
0.7 + 0.0015 * (D - 10) ^ 2,
0.7 - 0.0015 * (D - 10) ^ 2 + 0.03 * (D - 10)
)
)
}
beta_sex <- function(sex, OR = 1){
ifelse(sex == "Female", -0.5 * log(OR), 0.5 * log(OR))
}
plot_beta_duration <- function(select){
x <- seq(10, 20, by = 0.01)
y <- beta_duration(x, select)
data.frame(x = x,
y = y) %>%
ggplot(aes(x = x, y = y)) +
geom_line() +
ylim(0, 1)
}
# DATA SIMULATION ==============================================================
duration <- c(10, 12, 14, 18, 20)
sex <- factor(c("Female", "Male"))
eta <- function(duration, sex, duration_select, sex_OR, noise_sd){
beta_sex(sex, sex_OR) + beta_duration(duration, duration_select) + rnorm(length(duration), 0, noise_sd)
}
sim_data <- function(durations_type, sex_OR, noise_sd, p_female, n, seed){
set.seed(seed)
data.frame(
duration = sample(duration, n, TRUE),
sex = sample(sex, n, TRUE, c(p_female, 1 - p_female))
) %>%
rowwise() %>%
mutate(eta = eta(duration, sex, durations_type, sex_OR, noise_sd),
p = sigmoid(eta),
cured = sample(0:1, 1, prob = c(1 - p, p)))
}
# DATA SIM PARAMETERS
durations_type <- 4 # See beta_duration for functions
sex_OR <- 3 # Odds of cure for male vs female (ref)
noise_sd <- 1
p_female <- 0.7 # proportion of females in the sample
n <- 500
data <- sim_data(durations_type = 1, # See beta_duration for functions
sex_OR = 3, # Odds of cure for male vs female (ref)
noise_sd = 1,
p_female = 0.7, # proportion of females in the sample
n = 500,
seed = 21874564)
I am fitting a fractional polynomial GLM:
library(mfp)
model1 <- mfp(cured ~ fp(duration) + sex,
family = binomial(link = "logit"),
data = data)
summary(model1)
Given that I am holding sex as constant, is there any way to find the value of duration within a certain range that gives me the highest predicted value? Something less inefficient than:
range <- seq(10, 20, by = 1e-4)
range[which.max(predict(model, type = "response", newdata = data.frame(duration = range, sex = "Male")))]
You can use optimize here. Just create a function which returns a prediction based on the value of duration:
f <- function(x) predict(model1, list(sex = 'Male', duration = x))
And we can find the value of duration which produces the maximum log odds within the range 0-20 by doing:
optimise(f, c(0, 20), maximum = TRUE)$maximum
#> [1] 17.95679
I am using the causalTree package. When I launch the causalTree function, something is printed in the console.
library(causalTree)
n <- 1000
p <- 2
y <- rnorm(n)
X <- matrix(rnorm(n * p), ncol = p)
D <- rbinom(n, 1, 0.5)
tree <- causalTree(y ~ ., data = data.frame(y, X), treatment = D, split.Rule = "CT", cv.option = "CT", split.Honest = T, cv.Honest = T, split.Bucket = F, xval = 5, cp = 0, minsize = 20, propensity = 0.5)
# [1] 2
# [1] "CT"
Is there a way to avoid this?
Probably the causalTree has multiple print statements. You can prevent this to use invisible(capture.output(...)) like this:
library(causalTree)
n <- 1000
p <- 2
y <- rnorm(n)
X <- matrix(rnorm(n * p), ncol = p)
D <- rbinom(n, 1, 0.5)
invisible(capture.output(tree <- causalTree(y ~ ., data = data.frame(y, X),
treatment = D, split.Rule = "CT",
cv.option = "CT", split.Honest = T,
cv.Honest = T, split.Bucket = F, xval = 5,
cp = 0, minsize = 20, propensity = 0.5)))
Created on 2022-07-19 by the reprex package (v2.0.1)
I am trying to run the following code. My computer keeps getting frozen when I try to run it. Therefore, I can see the correlation matrices, I am unable to view the results of the GLM/data arrays.
# running the assay
#which_p_value = "x1"
which_p_value = "groupcategory"
#which_p_value = "x1:groupcategory"
run_anova = FALSE
simulate_mixed_effect = TRUE
mixed_effect_sd = 3.094069
mixed_effect_sd_slope = 3.098661
library(tidyverse)
n_people <- c(2,5,10,15,20)
coef1 <- 1.61
coef2 <- -0.01
#coef3 <- 5
#coef4 <- 0
g1 = 0
g2 = 1
g3 = 2
distances <- c(60,90,135,202.5,303.75,455.625)/100
n_trials <- 35
oneto1000 <- 25
n_track_lengths <- length(distances)
groupcategory = c(rep(g1, n_track_lengths), rep(g2, n_track_lengths),rep(g3,n_track_lengths))
z = c(n_people)
emptydataframeforpowerplots = NULL
coef3s <- c(-5, -4, -3, -2,-1, 0, 1, 2, 3, 4, 5)
coef4s <- c(-1, -0.8, -0.6, -0.4, -0.2, 0, 0.2, 0.4, 0.6, 0.8, 1)
Datarray <- array(dim=c(length(coef3s), length(coef4s),length(n_people)))
coef3_counter =1
for (coef3 in coef3s) {
coef4_counter =1
for (coef4 in coef4s) {
z1_g2 <- coef1 + coef2*distances + coef3*g2 + coef4*g2*distances
z1_g3 <- coef1 + coef2*distances + coef3*g3 + coef4*g3*distances
d = NULL
pr1 = 1/(1+exp(-z1_g2))
pr2 = 1/(1+exp(-z1_g3))
counter=1
for (i in n_people) {
for (j in 1:oneto1000){
df <- c()
for (k in 1:i){
# random effect from drawing a random intercept with sd = x
if (simulate_mixed_effect){
coef1_r = rnorm(1, mean=coef1, sd=mixed_effect_sd)
coef2_r = rnorm(1, mean=coef1, sd=mixed_effect_sd_slope)
} else {
coef1_r = coef1
coef2_r = coef2
}
z_g1 <- coef1_r + coef2*distances + coef3*g1 + coef4*g1*distances
pr = 1/(1+exp(-z_g1))
z1_g2 <- coef1_r + coef2*distances + coef3*g2 + coef4*g2*distances
pr1 = 1/(1+exp(-z1_g2))
if (run_anova) {
df <- rbind(df, data.frame(x1 = c(rep(distances, 3)),
y = c(rbinom(n_track_lengths,n_trials,pr), rbinom(n_track_lengths,n_trials,pr1),rbinom(n_track_lengths,n_trials,pr2)),
groupcategory = groupcategory, id = c(rep(k,18))))
} else { # this is for glmer data organisation
for (m in 1:n_trials) {
df <- rbind(df, data.frame(x1 = c(rep(distances, 3)),
y = c(rbinom(n_track_lengths,1,pr),rbinom(n_track_lengths,1,pr1),rbinom(n_track_lengths,1,pr2)),groupcategory = groupcategory,id = c(rep(k,18))))
}
}
}
if (run_anova) {
#df_aov <- aov(y~x1*groupcategory+Error(id/(x1*groupcategory)),data=df)
#df_aov_sum <- summary(df_aov)
#pvalue <- df_aov_sum[[5]][[1]][which_p_value,"Pr(>F)"]
df_aov <- aov(y~x1*groupcategory+Error(id),data=df)
df_aov_sum <- summary(df_aov)
pvalue <- df_aov_sum[[2]][[1]][which_p_value, "Pr(>F)"]
}
checkme <- df %>% group_by(groupcategory,id) %>% summarise(miny=min(y),maxy=max(y)) %>% mutate(expectfail = miny==maxy)
else {
mod_group_glmer <- glmer(y ~ x1 + groupcategory + (1+x1|id), data = df, family = "binomial")
sum <- summary(mod_group_glmer)
pvalue <- sum$coefficients[which_p_value, "Pr(>|z|)"]
}
d = rbind(d,data.frame(pvalue))
}
count <- plyr::ldply(d,function(c) sum(c<=0.05))
Datarray[coef3_counter,coef4_counter,counter] <- count$V1/oneto1000
counter = counter +1
d = NULL
}
coef4_counter = coef4_counter + 1
}
coef3_counter = coef3_counter + 1
}
Does anybody have any advice on how I can overcome this issue? I have tried different things such as lowering the range of sample sizes (n_people) but I have still been unsuccessful. My computer starts making a whirring noise and eventually I am forced to have to 'force quit' the program?
I have a data set like the following
set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T))
foo[, period := 1:.N, by = group]
Problem: I want to forecast y one step ahead, for each group, using variables x1, ..., x5
I want to run a few models in caret to decide which I will use.
As of now, I am running it in a loop using timeslice
window.length <- 115
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
model_list <- list()
for(g in unique(foo$group)){
for(model in c("xgbTree", "earth", "cubist")){
dat <- foo[group == g][, c('group', 'period') := NULL]
model_list[[g]][[model]] <- train(y ~ . - 1,
data = dat,
method = model,
trControl = timecontrol)
}
}
However, I would like to run all groups at the same time, using dummy variables to identify each one, like
dat <- cbind(foo, model.matrix(~ group- 1, foo))
y x1 x2 x3 x4 x5 period groupA groupB groupC groupD groupE groupF
1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04 1 1 1 0 0 0 0 0
2: 3.442213 8.6558983 32.41881 45.70801 3.255423e-01 1 1 0 1 0 0 0 0
3: 3.485286 7.7295448 21.99022 56.42133 8.668391e+00 1 1 0 0 1 0 0 0
4: 9.659601 0.9166456 30.34609 55.72661 -7.666063e+00 1 1 0 0 0 1 0 0
5: 5.567950 3.0306864 22.07813 52.21099 5.377153e-01 1 1 0 0 0 0 1 0
But still running the time series with the correct time ordering using timeslice.
Is there a way to declare the time variable in trainControl, so my one step ahead forecast uses, in this case, six more observations for each round and droping the first 6 observations?
I can do it by ordering the data and messing with the horizon argument (given n groups, order by the time variable and put horizon = n), but this has to change if the number of groups change. And initial.window will have to be time * n_groups
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
Is there any ohter way?
I think the answer you are looking for is actually quite simple. You can use the skip argument to trainControl() to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.
Using the example you provided, if you set skip = 6 and horizon = 6 (the number of groups), and initialWindow = 115, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
dat <- cbind(foo, model.matrix(~ group- 1, foo)) %>%
select(-group)
window.length <- 115
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
skip = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
~ train(
y ~ . - 1,
data = dat,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_names)
I would use tidyr::nest() to nest groups and then iterate over the data with purrr::map(). This approach is much more flexible because it can accommodate different group sizes, different numbers of groups, and variable models or other arguments passed to caret::train(). Also, you can easily run everything in parallel using furrr.
Load packages and create data
I use tibble instead of data.table. I also reduce the size of the data.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(
group = rep(LETTERS[1:6], 10),
y = rnorm(n = 6 * 10, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 10, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 10, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 10, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 10, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 10, replace = T)
) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
Reduce initialWindow size
window.length <- 9
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
Create a function that will return a list of fit model objects
# To fit each model in model_list to data and return model fits as a list.
fit_models <- function(data, model_list, timecontrol) {
map(model_list,
~ train(
y ~ . - 1,
data = data,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_list)
}
Fit models
model_list <- c("xgbTree", "earth", "cubist")
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))
If you want to view the results for a particular group / model you can do:
mods[which(mods$group == "A"), ]$fits[[1]]$xgbTree
Use furrr for parallel processing
Just initialize workers with plan(multiprocess) and change map to future_map. Note you might want to change the number of workers to something less than 6 if your computer has fewer than 6 processing cores.
library(furrr)
plan(multiprocess, workers = 6)
mods <- foo %>%
nest(-group)
mods <- mods %>%
mutate(fits = future_map(
data,
~ fit_models(
data = .x,
model_list = model_list,
timecontrol = timecontrol
)
))
The data can be found here
library(nlme)
library(dfoptim)
dat0 <- read.table("aids.dat2",head=T)
dat1 <- dat0[dat0$day<=90, ] # use only first 90-day data
dat2 <- dat1[!apply(is.na(dat1),1,any),] # remove missing data
aids.dat <- groupedData(lgcopy ~ day | patid, data=dat2)
aids.dat$log10copy = log10(aids.dat$lgcopy)
myfun2 <- function(arg){
s.p1 <- arg[1]
s.b1 <- arg[2]
s.p2 <- arg[3]
s.b2 <- arg[4]
model = nlme(log10copy ~ exp(p1-b1*day) + exp(p2-b2*day),
fixed = list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1),
random = list(patid = pdDiag(list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1))),
start = list(fixed = c(p1 = s.p1, b1 = s.b1, p2 = s.p2, b2 = s.b2)),
data =aids.dat)
return(model$logLik)
}
nmkb(fn = myfun2, par = c(10,0.5,6,0.005), lower = c(5, 0.1, 5, 0.001), upper = c(15, 1, 10, 0.1))
Running the above code, I run into several errors:
Error in nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
step halving factor reduced below minimum in PNLS step
In addition: Warning message:
In nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
Singular precision matrix in level -1, block 1
The model fits fine with the staring values from par = c(10,0.5,6,0.005). However, I think as the random algorithm starts using other starting values in lower = c(5, 0.1, 5, 0.001), upper = c(15, 1, 10, 0.1) the nlme call runs into the above problems because it's so sensitive to starting values. As a result, the nmkb call never amounts to anything.
Is there a way to circumvent this?
The model log-liklihood needs to be maximized, but many optimization procedures in R gives minimization result. So the function to be optimized has to be the negative log-likelihood. So it should look like this:
myfun2 <- function(arg){
s.p1 <- arg[1]
s.b1 <- arg[2]
s.p2 <- arg[3]
s.b2 <- arg[4]
model = nlme(log10copy ~ exp(p1-b1*day) + exp(p2-b2*day),
fixed = list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1),
random = list(patid = pdDiag(list(p1 ~ 1, b1 ~ 1, p2 ~ 1, b2 ~ 1))),
start = list(fixed = c(p1 = s.p1, b1 = s.b1, p2 = s.p2, b2 = s.b2)),
data =aids.dat)
return(-model$logLik)
}
And although there are still many warnings, there is no more error on my machine, and the algorithm converges successfully.
$par
[1] 13.460199068 0.848526199 7.764024099 0.001513636
$value
[1] -358.6631
$feval
[1] 197
$restarts
[1] 0
$convergence
[1] 0
$message
[1] "Successful convergence"
Warning messages:
1: In nlme.formula(log10copy ~ exp(p1 - b1 * day) + exp(p2 - b2 * day), :
Singular precision matrix in level -1, block 1