Related
I'm trying to use optimx for a constrained nonlinear problem, but I just can't find an example online that I can adjust (I'm not an R programmer). I found that I should be using the below to test a few algorithms
optimx(par, fn, lower=low, upper=up, method=c("CG", "L-BFGS-B", "spg", "nlm"))
I understand par is just an example of a feasible solution. So, if I have two variables and (0,3) is feasible I can just do par <- c(0,3). If I want to minimise
2x+3y
subject to
2x^2 + 3y^2 <= 100
x<=3
-x<=0
-y<=-3
I guess i can set fn like
fn <- function(x){return 2*x[0]+3*x[1]}
but how do I set lower and upper for my constraints?
Many thanks!
1) We can incorporate the constraints within the objective function by returning a large number if any constraint is violated.
For most methods (but not Nelder Mead) the requirement is that the objective function be continuous and differentiable and requires a starting value in the interior of the feasible region, not the boundary. These requirements are not satisfied for f below but we will try it anyways.
library(optimx)
f <- function(z, x = z[1], y = z[2]) {
if (2*x^2 + 3*y^2 <= 100 && x<=3 && -x<=0 && -y<=-3) 2*x+3*y else 1e10
}
optimx(c(0, 3), f, method = c("Nelder", "CG", "L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## Nelder-Mead 0 3 9 187 NA NA 0 FALSE FALSE 0.00
## CG 0 3 9 41 1 NA 0 FALSE FALSE 0.00
## L-BFGS-B 0 3 9 21 21 NA 52 FALSE FALSE 0.00
## spg 0 3 9 1077 NA 1 0 FALSE FALSE 0.05
## nlm 0 3 9 NA NA 1 0 FALSE FALSE 0.00
1a) This also works with optim where Nelder Mead is the default (or you could try constrOptim which explcitly supports inequality constraints).
optim(c(0, 3), f)
## $par
## [1] 0 3
##
## $value
## [1] 9
##
## $counts
## function gradient
## 187 NA
$convergence
[1] 0
$message
NULL
2) Above we notice that the 2x^2 + 3y^2 <= 100 constraint is not active so we can drop it. Now since the objective function is increasing in both x and y independently it is obvious that we want to set both of them to their lower bounds so c(0, 3) is the answer.
If we want to use optimx anyways then we just use upper= and lower= arguments for those methods that use them.
f2 <- function(z, x = z[1], y = z[2]) 2*x+3*y
optimx(c(0, 3), f2, lower = c(0, 3), upper = c(3, Inf),
method = c("L-BFGS-B", "spg", "nlm"))
## p1 p2 value fevals gevals niter convcode kkt1 kkt2 xtime
## L-BFGS-B 0 3 9 1 1 NA 0 FALSE NA 0.00
## spg 0 3 9 1 NA 0 0 FALSE NA 0.01
## nlminb 0 3 9 1 2 1 0 FALSE NA 0.00
## Warning message:
## In BB::spg(par = par, fn = ufn, gr = ugr, lower = lower, upper = upper, :
## convergence tolerance satisified at intial parameter values.
I have a multinomial logit model with two individual specific variables (first and age).
I would like to conduct the hmftest to check if the IIA holds.
My dataset looks like this:
head(df)
mode choice first age
1 both 1 0 24
2 pre 1 1 23
3 both 1 2 53
4 post 1 3 43
5 no 1 1 55
6 both 1 2 63
I adjusted it for the mlogit to:
mode choice first age idx
1 TRUE 1 0 24 1:both
2 FALSE 1 0 24 1:no
3 FALSE 1 0 24 1:post
4 FALSE 1 0 24 1:pre
5 FALSE 1 1 23 2:both
6 FALSE 1 1 23 2:no
7 FALSE 1 1 23 2:post
8 TRUE 1 1 23 2:pre
9 TRUE 1 2 53 3:both
10 FALSE 1 2 53 3:no
~~~ indexes ~~~~
id1 id2
1 1 both
2 1 no
3 1 post
4 1 pre
5 2 both
6 2 no
7 2 post
8 2 pre
9 3 both
10 3 no
indexes: 1, 2
My original (full) model runs as follows:
full <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no")
leading to the following result:
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
method = "nr")
Frequencies of alternatives:choice
no both post pre
0.2 0.4 0.2 0.2
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 8.11E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 2.0077e+01 1.0441e+04 0.0019 0.9985
(Intercept):post -4.1283e-01 1.4771e+04 0.0000 1.0000
(Intercept):pre 5.3346e-01 1.4690e+04 0.0000 1.0000
first1:both -4.0237e+01 1.1059e+04 -0.0036 0.9971
first1:post -8.9168e-01 1.4771e+04 -0.0001 1.0000
first1:pre -6.6805e-01 1.4690e+04 0.0000 1.0000
first2:both -1.9674e+01 1.0441e+04 -0.0019 0.9985
first2:post -1.8975e+01 1.5683e+04 -0.0012 0.9990
first2:pre -1.8889e+01 1.5601e+04 -0.0012 0.9990
first3:both -2.1185e+01 1.1896e+04 -0.0018 0.9986
first3:post 1.9200e+01 1.5315e+04 0.0013 0.9990
first3:pre 1.9218e+01 1.5237e+04 0.0013 0.9990
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 9.3377e-03 2.3157e-02 0.4032 0.6868
age:pre -1.2338e-02 2.2812e-02 -0.5408 0.5886
Log-Likelihood: -61.044
McFadden R^2: 0.54178
Likelihood ratio test : chisq = 144.35 (p.value = < 2.22e-16)
To test for IIA, I exclude one alternative from the model (here "pre") and run the model as follows:
part <- mlogit(mode ~ 0 | first + age, data = df_mlogit, reflevel = "no",
alt.subset = c("no", "post", "both"))
leading to
Call:
mlogit(formula = mode ~ 0 | first + age, data = df_mlogit, alt.subset = c("no",
"post", "both"), reflevel = "no", method = "nr")
Frequencies of alternatives:choice
no both post
0.25 0.50 0.25
nr method
18 iterations, 0h:0m:0s
g'(-H)^-1g = 6.88E-07
gradient close to zero
Coefficients :
Estimate Std. Error z-value Pr(>|z|)
(Intercept):both 1.9136e+01 6.5223e+03 0.0029 0.9977
(Intercept):post -9.2040e-01 9.2734e+03 -0.0001 0.9999
first1:both -3.9410e+01 7.5835e+03 -0.0052 0.9959
first1:post -9.3119e-01 9.2734e+03 -0.0001 0.9999
first2:both -1.8733e+01 6.5223e+03 -0.0029 0.9977
first2:post -1.8094e+01 9.8569e+03 -0.0018 0.9985
first3:both -2.0191e+01 1.1049e+04 -0.0018 0.9985
first3:post 2.0119e+01 1.1188e+04 0.0018 0.9986
age:both 2.1898e-02 2.9396e-02 0.7449 0.4563
age:post 1.9879e-02 2.7872e-02 0.7132 0.4757
Log-Likelihood: -27.325
McFadden R^2: 0.67149
Likelihood ratio test : chisq = 111.71 (p.value = < 2.22e-16)
However when I want to codnuct the hmftest then the following error occurs:
> hmftest(full, part)
Error in solve.default(diff.var) :
system is computationally singular: reciprocal condition number = 4.34252e-21
Does anyone have an idea where the problem might be?
I believe the issue here could be that the hmftest checks if the probability ratio of two alternatives depends only on the characteristics of these alternatives.
Since there are only individual-level variables here, the test won't work in this case.
I read that the R flexsurv package can also be used for modeling time-dependent covariates according to Christopher Jackson (2016) ["flexsurv: a platform for parametric survival modeling in R, Journal of Statistical Software, 70 (1)].
However, I was not able to figure out how, even after several adjustments and searches in online forums.
Before turning to the estimation of time-dependent covariates I tried to create a simple model with only time-independent covariates to test whether I specified the Surv object correctly. Here is a small example.
library(splitstackshape)
library(flexsurv)
## create sample data
n=50
set.seed(2)
t <- rpois(n,15)+1
x <- rnorm(n,t,5)
df <- data.frame(t,x)
df$id <- 1:n
df$rep <- df$t-1
Which looks like this:
t x id rep
1 12 17.696149 1 11
2 12 20.358094 2 11
3 11 2.058789 3 10
4 16 26.156213 4 15
5 13 9.484278 5 12
6 15 15.790824 6 14
...
And the long data:
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:n){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
Which looks like this:
t x id start stop censrec
1 12 17.69615 1 1 2 0
1.1 12 17.69615 1 2 3 0
1.2 12 17.69615 1 3 4 0
1.3 12 17.69615 1 4 5 0
1.4 12 17.69615 1 5 6 0
1.5 12 17.69615 1 6 7 0
1.6 12 17.69615 1 7 8 0
1.7 12 17.69615 1 8 9 0
1.8 12 17.69615 1 9 10 0
1.9 12 17.69615 1 10 11 0
1.10 12 17.69615 1 11 12 1
2 12 20.35809 2 1 2 0
...
Now I can estimate a simple Cox model to see whether it works:
coxph(Surv(t)~x,data=df)
This yields:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
And in the long format:
coxph(Surv(start,stop,censrec)~x,data=long.df)
I get:
coef exp(coef) se(coef) z p
x -0.0588 0.9429 0.0260 -2.26 0.024
Taken together I conclude that my transformation into the long format was correct. Now, turning to the flexsurv framework:
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
yields:
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00086 4.05569 6.16631 0.53452 NA NA NA
scale NA 13.17215 11.27876 15.38338 1.04293 NA NA NA
x 15.13380 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
But
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
causes an error:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull") :
Initial value for parameter 1 out of range
Would anyone happen to know the correct syntax for the latter Surv object? If you use the correct syntax, do you get the same estimates?
Thank you very much,
best,
David
===============
EDIT AFTER FEEDBACK FROM 42
===============
library(splitstackshape)
library(flexsurv)
x<-c(8.136527, 7.626712, 9.809122, 12.125973, 12.031536, 11.238394, 4.208863, 8.809854, 9.723636)
t<-c(2, 3, 13, 5, 7, 37 ,37, 9, 4)
df <- data.frame(t,x)
#transform into long format for time-dependent covariates
df$id <- 1:length(df$t)
df$rep <- df$t-1
long.df <- expandRows(df, "rep")
rep.vec<-c()
for(i in 1:length(df$t)){
rep.vec <- c(rep.vec,1:(df[i,"t"]-1))
}
long.df$start <- rep.vec
long.df$stop <- rep.vec +1
long.df$censrec <- 0
long.df$censrec<-ifelse(long.df$stop==long.df$t,1,long.df$censrec)
coxph(Surv(t)~x,data=df)
coxph(Surv(start,stop,censrec)~x,data=long.df)
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull",inits=c(shape=.1, scale=1))
Which yields the same estimates for both coxph models but
Call:
flexsurvreg(formula = Surv(time = t) ~ x, data = df, dist = "weibull")
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 1.0783 0.6608 1.7594 0.2694 NA NA NA
scale NA 27.7731 3.5548 216.9901 29.1309 NA NA NA
x 9.3012 -0.0813 -0.2922 0.1295 0.1076 0.9219 0.7466 1.1383
N = 9, Events: 9, Censored: 0
Total time at risk: 117
Log-likelihood = -31.77307, df = 3
AIC = 69.54614
and
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 0.8660 0.4054 1.8498 0.3353 NA NA NA
scale NA 24.0596 1.7628 328.3853 32.0840 NA NA NA
x 8.4958 -0.0912 -0.3563 0.1739 0.1353 0.9128 0.7003 1.1899
N = 108, Events: 9, Censored: 99
Total time at risk: 108
Log-likelihood = -30.97986, df = 3
AIC = 67.95973
Reading the error message:
Error in flexsurvreg(Surv(start, stop, censrec) ~ x, data = long.df, dist = "weibull", :
initial values must be a numeric vector
And then reading the help page, ?flexsurvreg, it seemed as though an attempt at setting values for inits to a named numeric vector should be attempted:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull", inits=c(shape=.1, scale=1))
Call:
flexsurvreg(formula = Surv(start, stop, censrec) ~ x, data = long.df,
dist = "weibull", inits = c(shape = 0.1, scale = 1))
Estimates:
data mean est L95% U95% se exp(est) L95% U95%
shape NA 5.00082 4.05560 6.16633 0.53454 NA NA NA
scale NA 13.17213 11.27871 15.38341 1.04294 NA NA NA
x 15.66145 0.01522 0.00567 0.02477 0.00487 1.01534 1.00569 1.02508
N = 715, Events: 50, Censored: 665
Total time at risk: 715
Log-likelihood = -131.5721, df = 3
AIC = 269.1443
Extremely similar results. My guess was basically a stab in the dark, so I have no guidance on how to make a choice if this had not succeeded other than to "expand the search."
I just want to mention that in flexsurv v1.1.1, running this code:
flexsurvreg(Surv(start,stop,censrec) ~ x ,data=long.df, dist="weibull")
doesn't return any errors. It also gives the same estimates as the non time-varying command
flexsurvreg(Surv(time=t)~x,data=df, dist="weibull")
What is the basic tree classifier for package gbm, rpart or ctree?
Suppose we want to do a 2-class classification with the following data:
D = iris[iris$Species!='setosa', ]
D$Species = as.factor(as.character(D$Species))
D$label = as.numeric(D$Species) - 1
D$Species = NULL
head(D)
Sepal.Length Sepal.Width Petal.Length Petal.Width label
51 7.0 3.2 4.7 1.4 0
52 6.4 3.2 4.5 1.5 0
53 6.9 3.1 4.9 1.5 0
54 5.5 2.3 4.0 1.3 0
55 6.5 2.8 4.6 1.5 0
56 5.7 2.8 4.5 1.3 0
The following is the result from rpart with max depth fixed at 1:
fit1 = rpart(data = D, label~., control = rpart.control(maxdepth=1), method = 'class')
fit1
n= 100
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 100 50 0 (0.50000000 0.50000000)
2) Petal.Width< 1.75 54 5 0 (0.90740741 0.09259259) *
3) Petal.Width>=1.75 46 1 1 (0.02173913 0.97826087) *
The splitting variable is Petal.Width at 1.75.
The following is the tree from gbm with shrinkage as 1:
fit = gbm(data=D, label ~., shrinkage = 1, n.trees = 1, train.fraction = 1)
pretty.gbm.tree(fit, i.tree = 1)
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 2 5.050000 1 2 3 9.610323 50 -0.240000
1 -1 -1.612903 -1 -1 -1 0.000000 31 -1.612903
2 -1 2.000000 -1 -1 -1 0.000000 19 2.000000
3 -1 -0.240000 -1 -1 -1 0.000000 50 -0.240000
The tree splits the root node using the 3rd column(indexed from 0 so 2+1 is 3) Petal.Length at value 5.05.
If I remove option train.fraction = 1 there will be some randomness in the fitted result. Is it automatically doing some bagging or CV?
So the big question is, how could we reproduce the very first tree in gbm result using rpart, ctree or others? And what shrinkage value should we use to get that?
Update:
The results are consistent if bag.fraction = 1 is added.
But why changing shrinkage won't affect the result? It only affects the result from the 2nd tree?
And how can we generate the 2nd tree in gbm results using rpart?
fit = gbm(data=D, label ~., shrinkage = 1, n.trees = 2, train.fraction = 1,bag.fraction = 1)
pretty.gbm.tree(fit, i.tree = 1)
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 3 1.750000 1 2 3 19.4847 100 0.000000
1 -1 -1.629630 -1 -1 -1 0.0000 54 -1.629630
2 -1 1.913043 -1 -1 -1 0.0000 46 1.913043
3 -1 0.000000 -1 -1 -1 0.0000 100 0.000000
pretty.gbm.tree(fit, i.tree = 2)
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 2 4.9500000 1 2 3 2.368983 100 0.1792148
1 -1 -0.9785858 -1 -1 -1 0.000000 54 -0.9785858
2 -1 1.5383721 -1 -1 -1 0.000000 46 1.5383721
3 -1 0.1792148 -1 -1 -1 0.000000 100 0.1792148
If the target is continuous, I can do it by following the ideas here http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/
But how can we do it for categorical target variables?
Can anyone tell me if it is possible to incorporate:
a)an interaction term
b)a random effect
in a Tobit regression model in R?
For the interaction term I have been working on the following script, but that doesn't work.
fit <- vglm(GAG_p_DNA~factor(condition)+factor(time)+factor(condition):factor(time),
tobit(Lower = 0))
Error in if ((temp <- sum(wz[, 1:M, drop = FALSE] < wzepsilon))) warning(paste(temp, :
argument is not interpretable as logical
I have also tried this with dummy variables, created in the following way:
time.ch<- C(time, helmert,2)
print(attributes(time.ch))
condition.ch<-C(condition, helmert, 3)
print(attributes(condition.ch))
but I get the same error.
Part of the dataset (GAG_p_DNA values of zero are left censored) (Warning: to those who may be copying this. The OP used tabs as separators.)
Donor Time Condition GAG_p_DNA cens_GAG_p_DNA
1 1 6 0.97 1
1 1 10 0.93 1
1 7 2 16.65 1
1 7 6 0.94 1
1 7 10 1.86 1
1 28 2 21.66 1
1 28 6 0.07 1
1 28 10 3.48 1
2 1 1 1.16 1
2 1 2 2.25 1
2 1 6 2.41 1
2 1 10 1.88 1
2 7 2 13.19 1
2 7 10 2.54 1
2 28 2 23.93 1
2 28 6 0 0
2 28 10 15.17 1
I most likely need to use a Tobit regression model, as it seems that a Cox model with left censored data is not supported by R...
fit<- survfit(Surv(GAG_p_DNA, cens_GAG_p_DNA, type="left")~factor(condition)+factor(Time))] [Error in coxph(Surv(GAG_p_DNA, cens_GAG_p_DNA, type = "left") ~ factor(condition) + : Cox model doesn't support "left" survival data
Try this:
survreg(Surv( GAG_p_DNA, cens_GAG_p_DNA, type='left') ~
factor(Time)*factor(Condition), data=sdat, dist='gaussian')
(Recommended by Therneau: http://markmail.org/search/?q=list%3Aorg.r-project.r-help+therneau+left+censor+tobit#query:list%3Aorg.r-project.r-help%20therneau%20left%20censor%20tobit+page:1+mid:fnczjvrnjlx5jsp5+state:results )
--- earlier effort;
With that tiny dataset (where I have corrected the use of tabs as separators) you won't get much. I corrected two errors (spelling of "Condition" and using 0 for left censoring where it should be 2 and it runs without error:
sdat$cens_GAG_p_DNA[sdat$cens_GAG_p_DNA==0] <- 2
fit <- survfit(Surv(GAG_p_DNA, cens_GAG_p_DNA, type="left") ~
factor(Condition) + factor(Time), data=sdat)
Warning messages:
1: In min(jtimes) : no non-missing arguments to min; returning Inf
2: In min(jtimes) : no non-missing arguments to min; returning Inf
3: In min(jtimes) : no non-missing arguments to min; returning Inf
4: In min(jtimes) : no non-missing arguments to min; returning Inf
5: In min(jtimes) : no non-missing arguments to min; returning Inf
6: In min(jtimes) : no non-missing arguments to min; returning Inf
7: In min(jtimes) : no non-missing arguments to min; returning Inf
8: In min(jtimes) : no non-missing arguments to min; returning Inf
9: In min(jtimes) : no non-missing arguments to min; returning Inf
> fit
Call: survfit(formula = Surv(GAG_p_DNA, cens_GAG_p_DNA, type = "left") ~
factor(Condition) + factor(Time), data = sdat)
records n.max n.start events median
factor(Condition)=1, factor(Time)=1 1 2 2 0 1.16
factor(Condition)=2, factor(Time)=1 1 2 2 0 2.25
factor(Condition)=2, factor(Time)=7 2 3 3 0 14.92
factor(Condition)=2, factor(Time)=28 2 3 3 0 22.80
factor(Condition)=6, factor(Time)=1 2 3 3 0 1.69
factor(Condition)=6, factor(Time)=7 1 2 2 0 0.94
factor(Condition)=6, factor(Time)=28 2 2 2 2 0.00
factor(Condition)=10, factor(Time)=1 2 3 3 0 1.41
factor(Condition)=10, factor(Time)=7 2 3 3 0 2.20
factor(Condition)=10, factor(Time)=28 2 3 3 0 9.32
0.95LCL 0.95UCL
factor(Condition)=1, factor(Time)=1 NA NA
factor(Condition)=2, factor(Time)=1 NA NA
factor(Condition)=2, factor(Time)=7 13.19 NA
factor(Condition)=2, factor(Time)=28 21.66 NA
factor(Condition)=6, factor(Time)=1 0.97 NA
factor(Condition)=6, factor(Time)=7 NA NA
factor(Condition)=6, factor(Time)=28 0.00 NA
factor(Condition)=10, factor(Time)=1 0.93 NA
factor(Condition)=10, factor(Time)=7 1.86 NA
factor(Condition)=10, factor(Time)=28 3.48 NA
The other aspect which I would call an error as well would be not using a data argument to regression functions. Trying to use "attached" dataframes, with any regression function but especially with the 'survival' package, will often cause strange errors.
I did find that putting in an interaction by way of hte formula method generated this error:
Error in survfit.formula(Surv(GAG_p_DNA, cens_GAG_p_DNA, type = "left") ~ :
Interaction terms are not valid for this function
And I also found that coxme::coxme, which I had speculated might give you access to mixed effects, did not handle left censoring.
fit <- coxme(Surv(GAG_p_DNA, cens_GAG_p_DNA, type="left")~factor(Condition)*factor(Time), data=sdat)
Error in coxme(Surv(GAG_p_DNA, cens_GAG_p_DNA, type = "left") ~ factor(Condition) * :
Cox model doesn't support 'left' survival data