compute numbers of leaf in rpart - r

i use rpart run a regression tree
library(MASS)
N = 1000
episolon = rnorm(N, 0, 0.01)
x1 = rnorm(N, 0, sd=1)
x2 = rnorm(N, 0, sd=1)
eta_x = 1/2*x1+x2
Kappa_x = 1/2*x1
w = rbinom(N, 1, 0.5)
treatment = w
makeY = function(eta, Kappa){
Y = eta+1/2*(2*w-1)*Kappa+episolon
}
Y1 = makeY(eta_x, Kappa_x)
fit = rpart(Y1 ~ x1 + x2)
plot(fit)
text(fit)
compute numbers of leaf in rpart
I want to have a function to give me there are 12 leaves in this tree

The fit object has all the information that you need. You can examine it using str(fit).
Two ways to find the number of leaves are:
sum(fit$frame$ncompete == 0)
[1] 11
AND
sum(fit$frame$var == "<leaf>")
[1] 11

Related

Using dede from the R deSolve/FME packages to fit data to a compartment model

I'm trying to fit the tetracycline data set from Bates & Watts to a compartment model which forms a system of first order differential equations. The system has an analytic solution but I want to use the dede function to estimate the parameters numerically.
I can get parameter estimates which are close to the ones published in Bates and Watts but I'm wondering if I have coded the problem correctly. Specifically, since Bates & Watts account for dead time in their solution, I'm concerned about whether I have coded the use of lagvalue() in the function called DiffEqns correctly.
My programming question relates to coding of the derivatives with lag time. They are currently coded as:
dy1 <- -theta1*y1lag
dy2 <- theta1*y1lag - theta2*y2lag
However, I wonder if the derivatives should be coded instead as:
dy1 <- -theta1*y1lag*y[1]
dy2 <- theta1*y1lag*y[1] - theta2*y2lag*y[2]
Thanks and regards,
# Analyze the tetracycline data set as a two-compartment model
# (see Bates & Watts, "Nonlinear Regression Analysis and Its Applications")
## Note: the differential equations for the compartment model are:
## dy1/dt = -theta1*y1
## dy2/dt = theta1*y1 - theta2*y2
## (see p. 169 in Bates & Watts)
# Load packages
library(FME)
# Create the tetracycline dataset (see p. 281 in Bates & Watts)
tetra <- structure(list(time = c(1, 2, 3, 4, 6, 8, 10, 12, 16),
conc = c(0.7,1.2, 1.4, 1.4, 1.1, 0.8, 0.6, 0.5, 0.3)),
row.names = c(NA, 9L), class = "data.frame")
# Observe that: A) "conc" = data for y2; B) there is no data for y1; C) data start at time = 1 instead of time = 0
# Create a differential equation model with dead time
DiffEqns <- function(t, y, parms) {
theta1 <- parms[1] # rate constant for y1
theta2 <- parms[2] # rate constant for y2
theta3 <- parms[3] # amount of y1 at time = 0
theta4 <- parms[4] # parameter that accounts for dead time
y1lag <- ifelse(t - theta4 < 0, 0, lagvalue(t - theta4, 1))
y2lag <- ifelse(t - theta4 < 0, 0, lagvalue(t - theta4, 2))
dy1 <- -theta1*y1lag
dy2 <- theta1*y1lag - theta2*y2lag
return(list(c(dy1, dy2), y1lag = y1lag, y2lag = y2lag))
}
# Find a numerical solution for the system of delay differential equations using dede() from deSolve
time <- seq(from = 0, to = 16, by = 0.1)
Cost <- function(P) {
theta1 <- P[1]
theta2 <- P[2]
theta3 <- P[3]
theta4 <- P[4]
theta <- c(theta1, theta2, theta3, theta4)
yinit <- c(y1 = theta3, conc = 0)
out <- dede(y = yinit, times = time, func = DiffEqns, parms = theta)
modCost(model = out, obs = tetra)
}
theta <- c(theta1 = 0.1, theta2 = 0.2, theta3 = 5, theta4 = 0.2) # starting values for the parameters
yinit <- c(y1 = theta[3], conc = 0)
CompModFit2 <- modFit(f = Cost, p = theta, lower = c(0,0,0,0))
FMEtheta <- coef(CompModFit2)
# Compare data to numerical model solution using parameters from modFit
dedeFitted <- dede(times = time,y = c(y1 = FMEtheta[3], conc = 0), func = DiffEqns, parms = FMEtheta)
plot(dedeFitted, obs=tetra)
# Parameters from FME are:
# theta1 theta2 theta3 theta4
#0.1193617 0.6974401 10.7188251 0.2206997
# Compare FME parameters to the parameter estimates published in Bates & Watts:
# theta1 theta2 theta3 theta4
# 0.1488 0.7158 10.10 0.4123

"One after the other" realisation of discrete random variables

I'm stuck with the following problem:
There are given n+1 discrete random variables:
X = {1,...,n} with P(x=i) = p_i
Y_i = {1,...,n_i} with P(y_i = j) = p_ij and i = 1,...,n
We do the following:
We draw from X and the result determines which Y_i we choose for the next step: If x = a, we use Y_a.
We draw from this Y_a.
Now my questions to this:
How do I get the Expected Value and the Variance of the whole?
Can this "process" be defined by a single random variable?
Assume we only know the EV and Var of all Y_i, but not all (or even none) of the probabilities. Can we still calculate the EV and Var of the whole process?
If 2) can be done, how to do this efficiently in R?
To give you an example of what I've tried:
X = {1,2} with P(x = 1) = 0.3 and P(x = 2) = 0.7
Y_1 = {2,3} with P(y_1 = 1) = 0.5 and P(y_1 = 3) = 0.5
Y_2 = {1,5,20} with P(y_2 = 1) = 0.3, P(y_2 = 5) = 0.6 and P(y_2 = 20) = 0.1
I have tried to combine those to a single random variable Z, but I'm not sure, if that can be done that way:
Z = {2,3,1,5,20} with probabilities (0.5*0.3, 0.5*0.3, 0.3*0.7, 0.6*0.7, 0.1*0.7)
The weighted EV is correct, but the "weighted" Var is different - if it is correct to use the formula for Var of linear combination for independent random variables. (Maybe just the formula for the combined Var is wrong.)
I used R and the package "discreteRV":
install.packages("discreteRV")
library(discreteRV)
#defining the RVs
Y_1 <- RV(outcomes = c(2, 3), probs = c(0.5, 0.5)) #occures 30% of the time
Y_2 <- RV(outcomes = c(1, 5, 20), probs = c(0.3, 0.6, 0.1)) #occures 70% of the time
Z <- RV(outcomes = c(2, 3, 1, 5, 20),
probs = c(0.5*0.3, 0.5*0.3, 0.3*0.7, 0.6*0.7, 0.1*0.7))
#calculating the EVs
E(Z)
E(Y_1)*0.3 + E(Y_2)*0.7
#calculating the VARs
V(Z)
V(Y_1)*(0.3)^2 + V(Y_2)*(0.7)^2
Thank you for your help.
Actually Z has a larger sample space expanded by Y1 and Y2, which is not a linear superposition of two components. In other words, we should present Z like Z = [0.3*Y1, 0.7*Y2] rather than Z = 0.3*Y1 + 0.7*Y2.
Since we have
V(Z) = E(Z**2)-E(Z)**2
> E(Z**2) -E(Z)**2
[1] 20.7684
> V(Z)
[1] 20.7684
We will easily find that in the term E(Z)**2, there are cross-product terms between Y1 and Y2, which makes V(Z) != V(Y_1)*(0.3)^2 + V(Y_2)*(0.7)^2.

How do I generate a list of b1 coefficients from simulations of a simple linear regression model?

Hi so I have to simulate the linear model: B0 + B1Xi + ei 100 times with sample sizes of 10. The parameters are given to me as ei~N(0,2^2), Xi~(0,1^2), B0 = 0.5, and B1 = 2. I need to extract the slope coefficient from each simulation. So far I have been able to get the coefficients of single simulations but when I try to use the r function coefficients for multiple simulations at once I get NULL returned. Here is my code so far:
b1sims = function(nrep = 10, b0 = 0.5, b1 = 2, sigma = 2){
e<-rnorm(n, 0, 2)
x<-rnorm(n, 0, 1)
y<-0.5 + 2*x + e
n = 10
simdata = data.frame(x, y)
b1fit = lm(y~x, data = simdata)
b1fit
}
coefficients(replicate(100, b1sims()))
This is what you need:
replicate(100, b1sims())[1,]
This will give you a list of the coefficients.
replicate(100, b1sims()) will give you a matrix which includes all the parameters of each regression model
#fix the function so it runs
b1sims = function(n = 10, b0 = 0.5, b1 = 2, sigma = 2){
e<-rnorm(n, 0, 2)
x<-rnorm(n, 0, 1)
y<-0.5 + 2*x + e
simdata = data.frame(x, y)
b1fit = lm(y~x, data = simdata)
b1fit
}
#create 100 models and loop over resulting list
#to extract coefficients
sapply(replicate(100, b1sims(), simplify = FALSE), coef)[2, ]

Confusion about 'standardize' option of glmnet package in R

I have a confusion about the standardize option of glmnet package in R. I get different coefficients when I standardize the covariates matrix and set standardize=FALSE vs. when I do not standardize the covariates matrix and set standardize=TRUE. I assumed they would be the same! These two are shown with an example by creating the following ridge.mod1 and ridge.mod2 models. I also created a model (ridge.mod3) that standardized the outcome (and the covariates matrix) and used the option standardize=FALSE. I was just checking if I needed to standardize the outcome too to get the same coefficients as in ridge.mod1.
set.seed(1)
y <- rnorm(30, 20, 10)
x1 <- rnorm(30, 5, 2)
x2 <- x1 + rnorm(30, 0, 5)
cor(x1,x2)
x <- as.matrix(cbind(x1,x2))
z1 <- scale(x1)
z2 <- scale(x2)
z <- as.matrix(cbind(z1,z2))
y.scale <- scale(y)
n <- 30
# Fixing foldid for proper comparison
foldid=sample(rep(seq(5),length=n))
table(foldid)
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = TRUE)
ridge.mod1 <- glmnet(x, y, alpha = 0, standardize = TRUE)
coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
> coef(ridge.mod1, s=cv.ridge.mod1$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
x1 2.856136e-37
x2 4.334910e-38
cv.ridge.mod2 <- cv.glmnet(z, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod2 <- glmnet(z, y, alpha = 0, standardize = FALSE)
coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
> coef(ridge.mod2, s=cv.ridge.mod2$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 2.082458e+01
V1 4.391657e-37
V2 2.389751e-37
cv.ridge.mod3 <- cv.glmnet(z, y.scale, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod3 <- glmnet(z, y.scale, alpha = 0, standardize = FALSE)
coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
> coef(ridge.mod3, s=cv.ridge.mod3$lambda.min)
3 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.023487e-16
V1 4.752255e-38
V2 2.585973e-38
Could anyone please tell me what's going on there and if (or how) I can get the same coefficients as in ridge.mod1 with prior standardization (in the data processing step) and then using standardize=FALSE?
Update: (what I tried based on the comments below)
So, I tried standardizing by SS/n instead of SS/(n-1). I tried by standardizing both y and x. Neither gave me coefficients equal to the de-standardized coefficients of model 1.
## Standadizing by sqrt(SS(X)/n) like glmnet instead of sqrt(SS(X)/(n-1)) which is done by the scale command
Xs <- apply(x, 2, function(m) (m - mean(m)) / sqrt(sum(m^2) / n))
Ys <- (y-mean(y)) / sqrt(sum(y^2) / n)
# Standadizing only X by sqrt(SS(X)/n)
cv.ridge.mod4 <- cv.glmnet(Xs, y, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod4 <- glmnet(Xs, y, alpha = 0, standardize = FALSE)
coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[2]/sd(x1)
[1] 7.995171e-38
> coef(ridge.mod4, s=cv.ridge.mod4$lambda.min)[3]/sd(x2)
[1] 2.957854e-38
# Standadizing both Y and X by sqrt(SS(X)/n) but neither is centered
cv.ridge.mod6 <- cv.glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE)
ridge.mod6 <- glmnet(Xs.noncentered, Ys.noncentered, alpha = 0, standardize = FALSE)
coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[2] / (sqrt(sum(x1^2) / n))
[1] 1.019023e-39
> coef(ridge.mod6, s=cv.ridge.mod6$lambda.min)[3] / (sqrt(sum(x2^2) / n))
[1] 9.189263e-40
What is it that still is wrong there?
I tweaked your code so that I can work with a more sensible problem. In order to reproduce the coefficients changing the standardize=TRUE and standardize=FALSE options you need to first standardize the variables with the (1/N) variance estimator formula. For this example I also centered the variables to get rid of the constant. I focus only on the coefficients of the variables. After that you have to notice that hence you have to invert that formula to get the de-standardized coefficients. I do that in the following code.
set.seed(1)
x1 <- rnorm(300, 5, 2)
x2 <- x1 + rnorm(300, 0, 5)
x3 <- rnorm(300, 6, 5)
e= rnorm(300, 0, 1)
y <- 0.3*x1+3.5*x2+x3+e
x <- as.matrix(cbind(x1,x2,x3))
sdN=function(x){
sigma=sqrt( (1/length(x)) * sum((x-mean(x))^2))
return(sigma)
}
n=300
foldid=sample(rep(seq(5),length=n))
g1=(x1-mean(x1))/sdN(x1)
g2=(x2-mean(x2))/sdN(x2)
g3=(x3-mean(x3))/sdN(x3)
gy=(y-mean(y))/sdN(y)
equis <- as.matrix(cbind(g1,g2,g3))
library(glmnet)
cv.ridge.mod1 <- cv.glmnet(x, y, alpha = 0, nfolds = 5, foldid=foldid,standardize = TRUE)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
cv.ridge.mod2 <- cv.glmnet(equis, gy, alpha = 0, nfolds = 5, foldid=foldid, standardize = FALSE, intercept=FALSE)
beta=coef(cv.ridge.mod2, s=cv.ridge.mod2$lambda.min)
beta[2]*sdN(y)/sdN(x1)
beta[3]*sdN(y)/sdN(x2)
beta[4]*sdN(y)/sdN(x3)
coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
this yields the results:
> beta[2]*sdN(y)/sdN(x1)
[1] 0.5984356
> beta[3]*sdN(y)/sdN(x2)
[1] 3.166033
> beta[4]*sdN(y)/sdN(x3)
[1] 0.9145646
>
> coef(cv.ridge.mod1, s=cv.ridge.mod1$lambda.min)
4 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 0.5951423
x1 0.5984356
x2 3.1660328
x3 0.9145646
As you can see the coefficients are the same at 4 decimals. So I hope this answer your question.

Why is gradient of first iteration step singular in nls with biv.norm

I am trying to fit a non-linear regression model where the mean-function is the bivariate normal distribution. The parameter to specify is the correlation rho.
The problem: "gradient of first iteration step is singular". Why?
I have here a little example with simulated data.
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
nls(formula= y ~ fun(x1, x2, rho), data= dat, start=list(rho=0.51),
lower=0, upper=1, trace=TRUE)
This yields an error message:
Error in nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
singulärer Gradient
In addition: Warning message:
In nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
Obere oder untere Grenzen ignoriert, wenn nicht algorithm= "port"
What I don't understand is
I have only one variable (rho), so there is only one gradient which must be =0 if the matrix of gradients is supposed to be singular. So why should the gradient be =0?
The start value cannot be the problem as I know the true rho=0.5. So the start value =0.51 should be fine, shouldn't it?
The data cannot be completely linear dependent as I added an error term to y.
I would appreciate help very much. Thanks already.
Perhaps "optim" does a better job than "nls":
library(mnormt)
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
f <- function(rho) { sum( sapply( 1:nrow(dat),
function(i){
(fun(dat[i,2],dat[i,3],rho) - dat[i,1])^2
} ) ) }
optim(0.51, f, method="BFGS")
The result is not that bad:
> optim(0.51, f, method="BFGS")
$par
[1] 0.5043406
$value
[1] 3.479377e-06
$counts
function gradient
14 4
$convergence
[1] 0
$message
NULL
Maybe even a little bit better than 0.5:
> f(0.5043406)
[1] 3.479377e-06
> f(0.5)
[1] 1.103484e-05
>
Let's check another start value:
> optim(0.8, f, method="BFGS")
$par
[1] 0.5043407
$value
[1] 3.479377e-06
$counts
function gradient
28 6
$convergence
[1] 0
$message
NULL

Resources