non linear regression with random effect and lsoda - r

I am facing a problem I do not manage to solve. I would like to use nlme or nlmODE to perform a non linear regression with random effect using as a model the solution of a second order differential equation with fixed coefficients (a damped oscillator).
I manage to use nlme with simple models, but it seems that the use of deSolve to generate the solution of the differential equation causes a problem. Below an example, and the problems I face.
The data and functions
Here is the function to generate the solution of the differential equation using deSolve:
library(deSolve)
ODE2_nls <- function(t, y, parms) {
S1 <- y[1]
dS1 <- y[2]
dS2 <- dS1
dS1 <- - parms["esp2omega"]*dS1 - parms["omega2"]*S1 + parms["omega2"]*parms["yeq"]
res <- c(dS2,dS1)
list(res)}
solution_analy_ODE2 = function(omega2,esp2omega,time,y0,v0,yeq){
parms <- c(esp2omega = esp2omega,
omega2 = omega2,
yeq = yeq)
xstart = c(S1 = y0, dS1 = v0)
out <- lsoda(xstart, time, ODE2_nls, parms)
return(out[,2])
}
I can generate a solution for a given period and damping factor, as for example here a period of 20 and a slight damping of 0.2:
# small example:
time <- 1:100
period <- 20 # period of oscillation
amort_factor <- 0.2
omega <- 2*pi/period # agular frequency
oscil <- solution_analy_ODE2(omega^2,amort_factor*2*omega,time,1,0,0)
plot(time,oscil)
Now I generate a panel of 10 individuals with a random starting phase (i.e. different starting position and velocity). The goal is to perform a non linear regression with random effect on the starting values
library(data.table)
# generate panel
Npoint <- 100 # number of time poitns
Nindiv <- 10 # number of individuals
period <- 20 # period of oscillation
amort_factor <- 0.2
omega <- 2*pi/period # agular frequency
# random phase
phase <- sample(seq(0,2*pi,0.01),Nindiv)
# simu data:
data_simu <- data.table(time = rep(1:Npoint,Nindiv), ID = rep(1:Nindiv,each = Npoint))
# signal generation
data_simu[,signal := solution_analy_ODE2(omega2 = omega^2,
esp2omega = 2*0.2*omega,
time = time,
y0 = sin(phase[.GRP]),
v0 = omega*cos(phase[.GRP]),
yeq = 0)+
rnorm(.N,0,0.02),by = ID]
If we have a look, we have a proper dataset:
library(ggplot2)
ggplot(data_simu,aes(time,signal,color = ID))+
geom_line()+
facet_wrap(~ID)
The problems
Using nlme
Using nlme with similar syntax working on simpler examples (non linear functions not using deSolve), I tried:
fit <- nlme(model = signal ~ solution_analy_ODE2(esp2omega,omega2,time,y0,v0,yeq),
data = data_simu,
fixed = esp2omega + omega2 + y0 + v0 + yeq ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 1,
v0 = 0))
I obtain:
Error in checkFunc(Func2, times, y, rho) : The number of derivatives returned by func() (2) must equal the length of the initial conditions vector (2000)
The traceback:
12. stop(paste("The number of derivatives returned by func() (", length(tmp[[1]]), ") must equal the length of the initial conditions vector (", length(y), ")", sep = ""))
11. checkFunc(Func2, times, y, rho)
10. lsoda(xstart, time, ODE2_nls, parms)
9. solution_analy_ODE2(omega2, esp2omega, time, y0, v0, yeq)
.
.
I looks like nlme is trying to pass a vector of starting condition to solution_analy_ODE2, and causes an error in checkFunc from lasoda.
I tried using nlsList:
test <- nlsList(model = signal ~ solution_analy_ODE2(omega2,esp2omega,time,y0,v0,yeq) | ID,
data = data_simu,
start = list(esp2omega = 0.08, omega2 = 0.04,yeq = 0,
y0 = 1,v0 = 0),
control = list(maxiter=150, warnOnly=T,minFactor = 1e-10),
na.action = na.fail, pool = TRUE)
head(test)
Call:
Model: signal ~ solution_analy_ODE2(omega2, esp2omega, time, y0, v0, yeq) | ID
Data: data_simu
Coefficients:
esp2omega omega2 yeq y0 v0
1 0.1190764 0.09696076 0.0007577956 -0.1049423 0.30234654
2 0.1238936 0.09827158 -0.0003463023 0.9837386 0.04773775
3 0.1280399 0.09853310 -0.0004908579 0.6051663 0.25216134
4 0.1254053 0.09917855 0.0001922963 -0.5484005 -0.25972829
5 0.1249473 0.09884761 0.0017730823 0.7041049 0.22066652
6 0.1275408 0.09966155 -0.0017522320 0.8349450 0.17596648
We can see that te non linear fit works well on individual signals. Now if I want to perform a regression of the dataset with random effects, the syntax should be:
fit <- nlme(test,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 1,
v0 = 0))
But I obtain the exact same error message.
I then tried using nlmODE, following Bne Bolker's comment on a similar question I asked some years ago
using nlmODE
library(nlmeODE)
datas_grouped <- groupedData( signal ~ time | ID, data = data_simu,
labels = list (x = "time", y = "signal"),
units = list(x ="arbitrary", y = "arbitrary"))
modelODE <- list( DiffEq = list(dS2dt = ~ S1,
dS1dt = ~ -esp2omega*S1 - omega2*S2 + omega2*yeq),
ObsEq = list(yc = ~ S2),
States = c("S1","S2"),
Parms = c("esp2omega","omega2","yeq","ID"),
Init = c(y0 = 0,v0 = 0))
resnlmeode = nlmeODE(modelODE, datas_grouped)
assign("resnlmeode", resnlmeode, envir = .GlobalEnv)
#Fitting with nlme the resulting function
model <- nlme(signal ~ resnlmeode(esp2omega,omega2,yeq,time,ID),
data = datas_grouped,
fixed = esp2omega + omega2 + yeq + y0 + v0 ~ 1,
random = y0 + v0 ~1,
start = c(esp2omega = 0.08,
omega2 = 0.04,
yeq = 0,
y0 = 0,
v0 = 0)) #
I get the error:
Error in resnlmeode(esp2omega, omega2, yeq, time, ID) : object 'yhat' not found
Here I don't understand where the error comes from, nor how to solve it.
Questions
Can you reproduce the problem ?
Does anyone have an idea to solve this problem, using either nlme or nlmODE ?
If not, is there a solution using an other package ? I saw nlmixr (https://cran.r-project.org/web/packages/nlmixr/index.html), but I don't know it, the instalation is complicated and it was recently remove from CRAN
Edits
#tpetzoldt suggested a nice way to debug nlme behavior, and it surprised me a lot. Here is a working example with a non linear function, where I generate a set of 5 individual with a random parameter varying between individuals :
reg_fun = function(time,b,A,y0){
cat("time : ",length(time)," b :",length(b)," A : ",length(A)," y0: ",length(y0),"\n")
out <- A*exp(-b*time)+(y0-1)
cat("out : ",length(out),"\n")
tmp <- cbind(b,A,y0,time,out)
cat(apply(tmp,1,function(x) paste(paste(x,collapse = " "),"\n")),"\n")
return(out)
}
time <- 0:10*10
ramdom_y0 <- sample(seq(0,1,0.01),10)
Nid <- 5
data_simu <-
data.table(time = rep(time,Nid),
ID = rep(LETTERS[1:Nid],each = length(time)) )[,signal := reg_fun(time,0.02,2,ramdom_y0[.GRP]) + rnorm(.N,0,0.1),by = ID]
The cats in the function give here:
time : 11 b : 1 A : 1 y0: 1
out : 11
0.02 2 0.64 0 1.64
0.02 2 0.64 10 1.27746150615596
0.02 2 0.64 20 0.980640092071279
0.02 2 0.64 30 0.737623272188053
0.02 2 0.64 40 0.538657928234443
0.02 2 0.64 50 0.375758882342885
0.02 2 0.64 60 0.242388423824404
0.02 2 0.64 70 0.133193927883213
0.02 2 0.64 80 0.0437930359893108
0.02 2 0.64 90 -0.0294022235568269
0.02 2 0.64 100 -0.0893294335267746
.
.
.
Now I do with nlme:
nlme(model = signal ~ reg_fun(time,b,A,y0),
data = data_simu,
fixed = b + A + y0 ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(b = 0.03, A = 1,y0 = 0))
I get:
time : 55 b : 55 A : 55 y0: 55
out : 55
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
time : 55 b : 55 A : 55 y0: 55
out : 55
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
0.03 1 0 0 0
0.03 1 0 10 -0.259181779318282
0.03 1 0 20 -0.451188363905974
0.03 1 0 30 -0.593430340259401
0.03 1 0 40 -0.698805788087798
0.03 1 0 50 -0.77686983985157
0.03 1 0 60 -0.834701111778413
0.03 1 0 70 -0.877543571747018
0.03 1 0 80 -0.909282046710588
0.03 1 0 90 -0.93279448726025
0.03 1 0 100 -0.950212931632136
...
So nlme binds 5 time (the number of individual) the time vector and pass it to the function, with the parameters repeated the same number of time. Which is of course not compatible with the way lsoda and my function works.

It seems that the ode model is called with a wrong argument, so that it gets a vector with 2000 state variables instead of 2. Try the following to see the problem:
ODE2_nls <- function(t, y, parms) {
cat(length(y),"\n") # <----
S1 <- y[1]
dS1 <- y[2]
dS2 <- dS1
dS1 <- - parms["esp2omega"]*dS1 - parms["omega2"]*S1 + parms["omega2"]*parms["yeq"]
res <- c(dS2,dS1)
list(res)
}
Edit: I think that the analytical function worked, because it is vectorized, so you may try to vectorize the ode function, either by iterating over the ode model or (better) internally using vectors as state variables. As ode is fast in solving systems with several 100k equations, 2000 should be feasible.
I guess that both, states and parameters from nlme are passed as vectors. The state variable of the ode model is then a "long" vector, the parameters can be implemented as a list.
Here an example (edited, now with parameters as list):
ODE2_nls <- function(t, y, parms) {
#cat(length(y),"\n")
#cat(length(parms$omega2))
ndx <- seq(1, 2*N-1, 2)
S1 <- y[ndx]
dS1 <- y[ndx + 1]
dS2 <- dS1
dS1 <- - parms$esp2omega * dS1 - parms$omega2 * S1 + parms$omega2 * parms$yeq
res <- c(dS2, dS1)
list(res)
}
solution_analy_ODE2 = function(omega2, esp2omega, time, y0, v0, yeq){
parms <- list(esp2omega = esp2omega, omega2 = omega2, yeq = yeq)
xstart = c(S1 = y0, dS1 = v0)
out <- ode(xstart, time, ODE2_nls, parms, atol=1e-4, rtol=1e-4, method="ode45")
return(out[,2])
}
Then set (or calculate) the number of equations, e.g. N <- 1 resp. N <-1000 before the calls.
The model runs through this way, before running in numerical issues, but that's another story ...
You may then try to use another ode solver (e.g. vode), set atoland rtol to lower values, tweak nmle's optimization parameters, use box constraints ... and so on, as usual in nonlinear optimization.

I found a solution hacking nlme behavior: as shown in my edit, the problem comes from the fact that nlme passes a vector of NindividualxNpoints to the nonlinear function, supposing that the function associates for each time point a value. But lsoda don't do that, as it integrates an equation along time (i.e. it need all time until a given time poit to produce a value).
My solution consists in decomposing the parameters that nlme passes to my function, make the calculation, and re-create a vector:
detect_id <- function(vec){
tmp <- c(0,diff(vec))
out <- tmp
out <- NA
out[tmp < 0] <- 1:sum(tmp < 0)
out <- na.locf(out,na.rm = F)
rleid(out)
}
detect_id decompose the time vector into single time vectors identificator:
detect_id(rep(1:10,3))
[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
And then, the function doing the numeric integration loop over each individuals, and bind the resulting vectors together:
solution_analy_ODE2_modif = function(omega2,esp2omega,time,y0,v0,yeq){
tmp <- detect_id(time)
out <- lapply(unique(tmp),function(i){
idxs <- which(tmp == i)
parms <- c(esp2omega = esp2omega[idxs][1],
omega2 = omega2[idxs][1],
yeq = yeq[idxs][1])
xstart = c(S1 = y0[idxs][1], dS1 = v0[idxs][1])
out_tmp <- lsoda(xstart, time[idxs], ODE2_nls, parms)
out_tmp[,2]
}) %>% unlist()
return(out)
}
It I make a test, where I pass a vector similar to whats nlme passes to the function:
omega2vec <- rep(0.1,30)
eps2omegavec <- rep(0.1,30)
timevec <- rep(1:10,3)
y0vec <- rep(1,30)
v0vec <- rep(0,30)
yeqvec = rep(0,30)
solution_analy_ODE2_modif(omega2 = omega2vec,
esp2omega = eps2omegavec,
time = timevec,
y0 = y0vec,
v0 = v0vec,
yeq = yeqvec)
[1] 1.0000000 0.9520263 0.8187691 0.6209244 0.3833110 0.1321355 -0.1076071 -0.3143798
[9] -0.4718058 -0.5697255 1.0000000 0.9520263 0.8187691 0.6209244 0.3833110 0.1321355
[17] -0.1076071 -0.3143798 -0.4718058 -0.5697255 1.0000000 0.9520263 0.8187691 0.6209244
[25] 0.3833110 0.1321355 -0.1076071 -0.3143798 -0.4718058 -0.5697255
It works. It would not work with #tpetzoldt method, because the time vector passes from 10 to 0, which would cause integration problems. Here I really need to hack the way nlnme works.
Now :
fit <- nlme(model = signal ~ solution_analy_ODE2_modif (esp2omega,omega2,time,y0,v0,yeq),
data = data_simu,
fixed = esp2omega + omega2 + y0 + v0 + yeq ~ 1,
random = y0 ~ 1 ,
groups = ~ ID,
start = c(esp2omega = 0.5,
omega2 = 0.5,
yeq = 0,
y0 = 1,
v0 = 1))
works like a charm
summary(fit)
Nonlinear mixed-effects model fit by maximum likelihood
Model: signal ~ solution_analy_ODE2_modif(omega2, esp2omega, time, y0, v0, yeq)
Data: data_simu
AIC BIC logLik
-597.4215 -567.7366 307.7107
Random effects:
Formula: list(y0 ~ 1, v0 ~ 1)
Level: ID
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
y0 0.61713329 y0
v0 0.67815548 -0.269
Residual 0.03859165
Fixed effects: esp2omega + omega2 + y0 + v0 + yeq ~ 1
Value Std.Error DF t-value p-value
esp2omega 0.4113068 0.00866821 186 47.45002 0.0000
omega2 1.0916444 0.00923958 186 118.14876 0.0000
y0 0.3848382 0.19788896 186 1.94472 0.0533
v0 0.1892775 0.21762610 186 0.86974 0.3856
yeq 0.0000146 0.00283328 186 0.00515 0.9959
Correlation:
esp2mg omega2 y0 v0
omega2 0.224
y0 0.011 -0.008
v0 0.005 0.030 -0.269
yeq -0.091 -0.046 0.009 -0.009
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.2692477 -0.6122453 0.1149902 0.6460419 3.2890201
Number of Observations: 200
Number of Groups: 10

Related

Nonlinear constrained optimization with optimx

I'm trying to use optimx for a constrained nonlinear problem, but I just can't find an example online that I can adjust (I'm not an R programmer). I found that I should be using the below to test a few algorithms
optimx(par, fn, lower=low, upper=up, method=c("CG", "L-BFGS-B", "spg", "nlm"))
I understand par is just an example of a feasible solution. So, if I have two variables and (0,3) is feasible I can just do par <- c(0,3). If I want to minimise
2x+3y
subject to
2x^2 + 3y^2 <= 100
x<=3
-x<=0
-y<=-3
I guess i can set fn like
fn <- function(x){return 2*x[0]+3*x[1]}
but how do I set lower and upper for my constraints?
Many thanks!
1) We can incorporate the constraints within the objective function by returning a large number if any constraint is violated.
For most methods (but not Nelder Mead) the requirement is that the objective function be continuous and differentiable and requires a starting value in the interior of the feasible region, not the boundary. These requirements are not satisfied for f below but we will try it anyways.
library(optimx)
f <- function(z, x = z[1], y = z[2]) {
if (2*x^2 + 3*y^2 <= 100 && x<=3 && -x<=0 && -y<=-3) 2*x+3*y else 1e10
}
optimx(c(0, 3), f, method = c("Nelder", "CG", "L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## Nelder-Mead 0 3 9 187 NA NA 0 FALSE FALSE 0.00
## CG 0 3 9 41 1 NA 0 FALSE FALSE 0.00
## L-BFGS-B 0 3 9 21 21 NA 52 FALSE FALSE 0.00
## spg 0 3 9 1077 NA 1 0 FALSE FALSE 0.05
## nlm 0 3 9 NA NA 1 0 FALSE FALSE 0.00
1a) This also works with optim where Nelder Mead is the default (or you could try constrOptim which explcitly supports inequality constraints).
optim(c(0, 3), f)
## $par
## [1] 0 3
##
## $value
## [1] 9
##
## $counts
## function gradient
## 187 NA
$convergence
[1] 0
$message
NULL
2) Above we notice that the 2x^2 + 3y^2 <= 100 constraint is not active so we can drop it. Now since the objective function is increasing in both x and y independently it is obvious that we want to set both of them to their lower bounds so c(0, 3) is the answer.
If we want to use optimx anyways then we just use upper= and lower= arguments for those methods that use them.
f2 <- function(z, x = z[1], y = z[2]) 2*x+3*y
optimx(c(0, 3), f2, lower = c(0, 3), upper = c(3, Inf),
method = c("L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## L-BFGS-B 0 3 9 1 1 NA 0 FALSE NA 0.00
## spg 0 3 9 1 NA 0 0 FALSE NA 0.01
## nlminb 0 3 9 1 2 1 0 FALSE NA 0.00
## Warning message:
## In BB::spg(par = par, fn = ufn, gr = ugr, lower = lower, upper = upper, :
## convergence tolerance satisified at intial parameter values.

R: calculating interests and balance at each step

I have a stupid question but I can't solve it easily with lag/lead or other things
Let's say I have this table, I have an initial balance of 100, Position is if I bid or not, and percentage is what I get if I bid, how can i calculate the balance to get something like this?
Position Percentage_change Balance
0 0.01 100
0 - 0.01 100
1 0.02 102
1 0.05 107.1
0 - 0.02 107.1
1 0.03 110.3
cumprod is the function you are looking for eg
df <- data.frame(Position = c(0,0,1,1,0,1),
Percentage_change = c(0.01, -0.01, 0.02, 0.05, -0.02, 0.03))
# convert in to multiplier form eg 100 * 1.01
df$Multiplier <- df$Percentage_change + 1
# when position is 0, reset this to 1 so there is no change to the balance
df[df$Position == 0, ]$Multiplier <- 1
# take starting balance of 100 and times by cumulative product of the multipliers
df$Balance <- 100 * cumprod(df$Multiplier)
df
Position Percentage_change Multiplier Balance
1 0 0.01 1.00 100.000
2 0 -0.01 1.00 100.000
3 1 0.02 1.02 102.000
4 1 0.05 1.05 107.100
5 0 -0.02 1.00 107.100
6 1 0.03 1.03 110.313

Setting covariates to the mean for marginal effects using ggpredict

I have some data (df):
inter out time int
0 1 21 0
0 0 32 0
0 1 44 0
0 0 59 0
0 1 88 0
0 1 111 0
0 0 54 0
1 0 63 63
1 1 73 73
1 1 83 83
1 0 93 93
1 1 52 52
1 0 33 33
1 1 10 10
And I run a glm model:
m <- glm(out ~ inter + time + int, data = df, family = binomial(link = "logit"))
The model coefficients are:
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.00916 1.82200 -0.554 0.580
inter 2.00906 2.64959 0.758 0.448
time 0.02293 0.03010 0.762 0.446
int -0.03502 0.04215 -0.831 0.406
I want to get the marginal effects, which according to my understanding is the predicted probabilities at certain levels holding other variables constant, which in this case is 0 vs. 1 for my binary predictor, 'inter'. If this in incorrect, please let me know. According to [https://rdrr.io/cran/ggeffects/man/ggpredict.html], "All remaining covariates that are not specified in terms are held constant (see 'Details')". The mean of time and int are 58.29 and 29.07, respectively, so the formula to get the predicted probabilities is:
Level 0:
sum = -1.00916 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.33
Level 1:
sum = -1.00916 + 2.00906 + (.02293 * 58.29) + (-.03502 * 29.07)
sume = exp(sum)
sumee <- sume/(1+sume)
sumee = 0.79
The predicted probability holding other variables constant is 0.79 for level 1 compared to 0.33 for level 0, which is exactly what these ggpredict statements produce:
ggpredict(m, terms = c("inter", "time [mean]"))
ggpredict(m, terms = c("inter"))
However, when I specify 'int' at the mean with "int [mean]", it produces different results:
ggpredict(m, terms = c("inter", "time [mean]", "int [mean]"))
ggpredict(m, terms = c("inter", "int [mean]"))
It says level 0 has a predicted probability of 0.19, compared to 0.64 for level 1. Why? Shouldn't all four commands produce the same results since r automatically calculates covariates at the mean? Using other functions for 'int', such as min and max ggpredict(m, terms = c("inter", "time [mean]", "int [min]")) produces predictable results based on the formula.

Run xgboost model on single test data point

I am trying to role Xg boost model on single test data point.
a <- data.frame(satisfaction_level=0.14,
last_evaluation=0.92,
number_project=2,
average_montly_hours=350,
time_spend_company=5,
Work_accident=0,
promotion_last_5years=1,
sales=factor("sales",levels=levels(Bdata$sales)),
salary=factor("medium",levels=levels(Bdata$salary)))
#Converting it into matrix format
str(a)
a <- as.data.frame.model.matrix(a)
I get below error when I predict using the model
xgb.preds = predict(xgb.model, a)
Error in xgb.DMatrix(newdata, missing = missing) :
xgb.DMatrix: does not support to construct from list
Created the model using:
xgb.model <- xgboost(param =param, data = xgb.train.data,nrounds = 1500 ,eta = 0.05,subsample = 1 )
and Bdata contains:
head(Bdata)
satisfaction_level last_evaluation number_project average_montly_hours time_spend_company Work_accident left promotion_last_5years sales salary
1 0.38 0.53 2 157 3 0 1 0 sales low
2 0.80 0.86 5 262 6 0 1 0 sales medium
3 0.11 0.88 7 272 4 0 1 0 sales medium
4 0.72 0.87 5 223 5 0 1 0 sales low
5 0.37 0.52 2 159 3 0 1 0 sales low
6 0.41 0.50 2 153 3 0 1 0 sales low
>
You should not use as.data.frame.model.matrix. Your a object is still a data.frame. You need to use a <- as.matrix(a).
See below for a workable example using the iris dataset.
library(xgboost)
x = as.matrix(iris[, 1:4])
y = as.numeric(factor(iris[, 5]))-1
model <- xgboost(data = x, label = y, nrounds = 10)
new <- data.frame(Sepal.Length = 5.1,
Sepal.Width = 3.5,
Petal.Length = 1.4,
Petal.Width = 0.2)
#error because it is a data.frame
preds <- predict(model, newdata = new)
# Error in xgb.DMatrix(newdata, missing = missing) :
# xgb.DMatrix: does not support to construct from list
# This works because data.frame is turned into a matrix
preds <- predict(model, newdata = as.matrix(new))

Assigning a value to each range of consecutive numbers with same sign in R

I'm trying to create a data frame where a column exists that holds values representing the length of runs of positive and negative numbers, like so:
Time V Length
0.5 -2 1.5
1.0 -1 1.5
1.5 0 0.0
2.0 2 1.0
2.5 0 0.0
3.0 1 1.75
3.5 2 1.75
4.0 1 1.75
4.5 -1 0.75
5.0 -3 0.75
The Length column sums the length of time that the value has been positive or negative. Zeros are given a 0 since they are an inflection point. If there is no zero separating the sign change, the values are averaged on either side of the inflection.
I am trying to approximate the amount of time that these values are spending either positive or negative. I've tried this with a for loop with varying degrees of success, but I would like to avoid looping because I am working with extremely large data sets.
I've spent some time looking at sign and diff as they are used in this question about sign changes. I've also looked at this question that uses transform and aggregate to sum consecutive duplicate values. I feel like I could use this in combination with sign and/or diff, but I'm not sure how to retroactively assign these sums to the ranges that created them or how to deal with spots where I'm taking the average across the inflection.
Any suggestions would be appreciated. Here is the sample dataset:
dat <- data.frame(Time = seq(0.5, 5, 0.5), V = c(-2, -1, 0, 2, 0, 1, 2, 1, -1, -3))
First find indices of "Time" which need to be interpolated: consecutive "V" which lack a zero between positive and negative values; they have an abs(diff(sign(V)) equal to two.
id <- which(abs(c(0, diff(sign(dat$V)))) == 2)
Add rows with average "Time" between relevant indices and corresponding "V" values of zero to the original data. Also add rows of "V" = 0 at "Time" = 0 and at last time step (according to the assumptions mentioned by #Gregor). Order by "Time".
d2 <- rbind(dat,
data.frame(Time = (dat$Time[id] + dat$Time[id - 1])/2, V = 0),
data.frame(Time = c(0, max(dat$Time)), V = c(0, 0))
)
d2 <- d2[order(d2$Time), ]
Calculate time differences between time steps which are zero and replicate them using "zero-group indices".
d2$Length <- diff(d2$Time[d2$V == 0])[cumsum(d2$V == 0)]
Add values to original data:
merge(dat, d2)
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 1.00
# 4 2.0 2 1.00
# 5 2.5 0 1.75
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Set "Length" to 0 where V == 0.
This works, at least for your test case. And it should be pretty efficient. It makes some assumptions, I'll try to point out the big ones.
First we extract the vectors and stick 0s on the beginning. We also set the last V to 0. The calculation will be based on time differences between 0s, so we need to start and end with 0s. Your example seems to tacitly assume V = 0 at Time = 0, hence the initial 0, and it stops abruptly at the maximum time, so we set V = 0 there as well:
Time = c(0, dat$Time)
V = c(0, dat$V)
V[length(V)] = 0
To fill in the skipped 0s, we use approx to do linear approximation on sign(V). It also assumes that your sampling frequency is regular, so we can get away with doubling the frequency to get all the missing 0s.
ap = approx(Time, sign(V), xout = seq(0, max(Time), by = 0.25))
The values we want to fill in are the durations between the 0s, both observed and approximated. In the correct order, these are:
dur = diff(ap$x[ap$y == 0])
Lastly, we need the indices of the original data to fill in the durations. This is the hackiest part of this answer, but it seem to work. Maybe someone will suggest a nice simplification.
# first use rleid to get the sign groupings
group = data.table::rleid(sign(dat$V))
# then we need to set the groups corresponding to 0 values to 0
# and reduce any group numbers following 0s correspondingly
# lastly we add 1 to everything so that we can stick 0 at the
# front of our durations and assign those to the 0 V values
ind = (group - cumsum(dat$V == 0)) * (dat$V != 0) + 1
# fill it in
dat$Length = c(0, dur)[ind]
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
It took me longer than I care to admit, but here is my solution.
Because you said you wanted to use it on large datasets (thus speed matters) I use Rcpp to write a loop that does all the checking. For speed comparisons I also create another sample dataset with 500,000 data.points and check the speed (I tried to compare to the other datasets but couldn't translate them to data.table (without that it would be an unfair comparison...)). If supplied, I will gladly update the speed-comparisons!
Part 1: My solution
My solution looks like this:
(in length_time.cpp)
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector length_time(NumericVector time, NumericVector v) {
double start = 0;
double time_i, v_i;
bool last_positive = v[0] > 0;
bool last_negative = v[0] < 0;
int length_i = time.length();
NumericVector ret_vec(length_i);
for (int i = 0; i < length_i; ++i) {
time_i = time[i];
v_i = v[i];
if (v_i == 0) { // injection
if (i > 0) { // if this is not the beginning, then a regime has ended!
ret_vec[i - 1] = time_i - start;
start = time_i;
}
} else if ((v_i > 0 && last_negative) || (v_i < 0 && last_positive)) {
ret_vec[i - 1] = (time_i + time[i - 1]) / 2 - start;
start = (time_i + time[i - 1]) / 2;
}
last_positive = v_i > 0;
last_negative = v_i < 0;
}
ret_vec[length_i - 1] = time[length_i - 1] - start;
// ret_vec now only has the values for the last observation
// do something like a reverse na_locf...
double tmp_val = ret_vec[length_i - 1];
for (int i = length_i - 1; i >= 0; --i) {
if (v[i] == 0) {
ret_vec[i] = 0;
} else if (ret_vec[i] == 0){
ret_vec[i] = tmp_val;
} else {
tmp_val = ret_vec[i];
}
}
return ret_vec;
}
and then in an R-file (i.e., length_time.R):
library(Rcpp)
# setwd("...") #to find the .cpp-file
sourceCpp("length_time.cpp")
dat$Length <- length_time(dat$Time, dat$V)
dat
# Time V Length
# 1 0.5 -2 1.50
# 2 1.0 -1 1.50
# 3 1.5 0 0.00
# 4 2.0 2 1.00
# 5 2.5 0 0.00
# 6 3.0 1 1.75
# 7 3.5 2 1.75
# 8 4.0 1 1.75
# 9 4.5 -1 0.75
# 10 5.0 -3 0.75
Which seems to work on the sample dataset.
Part 2: Testing for Speed
library(data.table)
library(microbenchmark)
n <- 10000
set.seed(1235278)
dt <- data.table(time = seq(from = 0.5, by = 0.5, length.out = n),
v = cumsum(round(rnorm(n, sd = 1))))
dt[, chg := v >= 0 & shift(v, 1, fill = 0) <= 0]
plot(dt$time, dt$v, type = "l")
abline(h = 0)
for (i in dt[chg == T, time]) abline(v = i, lty = 2, col = "red")
Which results in a dataset with 985 observations (crossings).
Testing the speed with microbenchmark results in
microbenchmark(dt[, length := length_time(time, v)])
# Unit: milliseconds
# expr min lq mean median uq max neval
# dt[, `:=`(length, length_time(time, v))] 2.625714 2.7184 3.054021 2.817353 3.077489 5.235689 100
Resulting in about 3 milliseconds for calculating with 500,000 observations.
Does that help you?
Here is my attempt done completely in base R.
Joseph <- function(df) {
is.wholenumber <- function(x, tol = .Machine$double.eps^0.5) abs(x - round(x)) < tol
v <- df$V
t <- df$Time
sv <- sign(v)
nR <- length(v)
v0 <- which(v==0)
id <- which(abs(c(0, diff(sv))) > 1) ## This line and (t[id] + t[id - 1L])/2 From #Henrik
myZeros <- sort(c(v0*t[1L], (t[id] + t[id - 1L])/2))
lenVals <- diff(c(0,myZeros,t[nR])) ## Actual values that
## will populate the Length column
## remove values that result from repeating zeros from the df$V column
lenVals <- lenVals[lenVals != t[1L] | c(!is.wholenumber(myZeros/t[1L]),F)]
## Below we need to determine how long to replicate
## each of the lenVals above, so we need to find
## the starting place and length of each run...
## rle is a great candidate for both of these
m <- rle(sv)
ml <- m$lengths
cm <- cumsum(ml)
zm <- m$values != 0 ## non-zero values i.e. we won't populate anything here
rl <- m$lengths[zm] ## non-zero run-lengths
st <- cm[zm] - rl + 1L ## starting index
out <- vector(mode='numeric', length = nR)
for (i in 1:length(st)) {out[st[i]:(st[i]+rl[i]-1L)] <- lenVals[i]}
df$Length <- out
df
}
Here is the output of the given example:
Joseph(dat)
Time V Length
1 0.5 -2 1.50
2 1.0 -1 1.50
3 1.5 0 0.00
4 2.0 2 1.00
5 2.5 0 0.00
6 3.0 1 1.75
7 3.5 2 1.75
8 4.0 1 1.75
9 4.5 -1 0.75
10 5.0 -3 0.75
Here is a larger example:
set.seed(142)
datBig <- data.frame(Time=seq(0.5,50000,0.5), V=sample(-3:3, 10^5, replace=TRUE))
library(compiler)
library(data.table)
library(microbenchmark)
c.Joseph <- cmpfun(Joseph)
c.Henrik <- cmpfun(Henrik)
c.Gregor <- cmpfun(Gregor)
microbenchmark(c.Joseph(datBig), c.Gregor(datBig), c.Henrik(datBig), David(datBig), times = 10)
Unit: milliseconds
expr min lq mean median uq max neval cld
David(datBig) 2.20602 2.617742 4.35927 2.788686 3.13630 114.0674 10 a
c.Joseph(datBig) 61.91015 62.62090 95.44083 64.43548 93.20945 225.4576 10 b
c.Gregor(datBig) 59.25738 63.32861 126.29857 72.65927 214.35961 229.5022 10 b
c.Henrik(datBig) 1511.82449 1678.65330 1727.14751 1730.24842 1816.42601 1871.4476 10 c
As #Gregor pointed out, the goal is to find the x-distance between each occurrence of zero. This can be seen visually by plotting (again, as pointed out by #Gregor (many kudos btw)). For example, if we plot the first 20 values of datBig, we obtain:
From this, we can see that the x-distances such that the graph is either positive or negative (i.e. not zero (this happens when there are repeats of zeros)) are approximately:
2.0, 1.25, 0.5, 0.75, 2.0, 1.0, 0.75, 0.5
t1 <- c.Joseph(datBig)
t2 <- c.Gregor(datBig)
t3 <- c.Henrik(datBig)
t4 <- David(datBig)
## Correct values according to the plot above (x above a value indicates incorrect value)
## 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## all correct
t1$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
## mostly correct
t2$Length[1:20] x x x x x
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 0.75 0.75 0.75 0.00 0.00 0.00 0.50 0.00 0.75 0.25
## least correct
t3$Length[1:20] x x x x x x x x x x x x x
[1] 2.00 2.00 2.00 0.50 1.00 1.25 0.75 1.25 0.00 1.75 1.75 0.00 1.50 1.50 0.00 0.00 1.25 1.25 1.25 1.25
## all correct
t4$Length[1:20]
[1] 2.00 2.00 2.00 0.00 1.25 1.25 0.50 0.75 0.00 0.00 2.00 2.00 2.00 0.00 0.00 0.00 1.00 0.00 0.75 0.50
# agreement with David's solution
all.equal(t4$Length, t1$Length)
[1] TRUE
Well, it seems the Rcpp solution provided by David is not only accurate but blazing fast.

Resources