I'm trying to simulate a simple linear model 100 times and find the LS estimation of B1 from the linear model.
set.seed(123498)
x<-rnorm(z, 0, 1)
e<-rnorm(z, 0 ,2)
y<-0.5 + 2*x + e
model<- lm(y~x)
simulaten=100
z=10
for (i in 1:simulaten){
e<-rnorm(n, 0 ,2)
x<-rnorm(n, 0, 1)
y<-0.5 + 2*x + e
model<- lm(y~x)}
summary(model)
Is that what my for loop is achieving or have i missed the mark?
Here is a replicate solution. I have set n (forgotten in the question) and simulaten to a smaller value.
n <- 100
simulaten <- 4
set.seed(123498)
model_list <- replicate(simulaten, {
e <- rnorm(n, 0, 2)
x <- rnorm(n, 0, 1)
y <- 0.5 + 2*x + e
lm(y ~ x)
}, simplify = FALSE)
model_list
Edit
Several statistics can be obtained from the models list. The coefficients are extracted with function coef applied to each model.
Done with sapply, the returned object is a 2 rows matrix.
betas <- sapply(model_list, coef)
str(betas)
# num [1:2, 1:1000] 0.671 1.875 0.374 2.019 0.758 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:2] "(Intercept)" "x"
# ..$ : NULL
As for the graph, here is an example. Note that in order for the x axis to reach all the x values, in the first call to hist argument xlim is set to range(betas).
lgd <- c(expression(beta[0]), expression(beta[1]))
hist(betas[1, ], freq = FALSE, col = "lightblue", xlim = range(betas), ylim = c(0, 2.5), xlab = "betas", main = "")
hist(betas[2, ], freq = FALSE, col = "blue", add = TRUE)
legend("top", legend = lgd, fill = c("lightblue", "blue"), horiz = TRUE)
The model is updated in each of the iteration. So the summary is returning the summary output of the last 'model'. We could store it in a list.
# // initialize empty list of length equals length of simulaten
modellst <- vector('list', simulaten)
for(i in seq_len(simulaten)) {
e <- rnorm(n, 0 ,2)
x <- rnorm(n, 0, 1)
y <- 0.5 + 2*x + e
# // assign the model output to the corresponding list element
modellst[[i]] <- lm(y~x)
}
Related
I want to plot a partition of a two-dimensional covariate space constructed by recursive binary splitting. To be more precise, I would like to write a function that replicates the following graph (taken from Elements of Statistical Learning, pag. 306):
Displayed above is a two-dimensional covariate space and a partition obtained by recursive binary splitting the space using axis-aligned splits (what is also called a CART algorithm). What I want to implement is a function that takes the output of the rpart function and generates such plot.
It follows some example code:
## Generating data.
set.seed(1975)
n <- 5000
p <- 2
X <- matrix(sample(seq(0, 1, by = 0.01), n * p, replace = TRUE), ncol = p)
Y <- X[, 1] + 2 * X[, 2] + rnorm(n)
## Building tree.
tree <- rpart(Y ~ ., data = data.frame(Y, X), method = "anova", control = rpart.control(cp = 0, maxdepth = 2))
Navigating SO I found this function:
rpart_splits <- function(fit, digits = getOption("digits")) {
splits <- fit$splits
if (!is.null(splits)) {
ff <- fit$frame
is.leaf <- ff$var == "<leaf>"
n <- nrow(splits)
nn <- ff$ncompete + ff$nsurrogate + !is.leaf
ix <- cumsum(c(1L, nn))
ix_prim <- unlist(mapply(ix, ix + c(ff$ncompete, 0), FUN = seq, SIMPLIFY = F))
type <- rep.int("surrogate", n)
type[ix_prim[ix_prim <= n]] <- "primary"
type[ix[ix <= n]] <- "main"
left <- character(nrow(splits))
side <- splits[, 2L]
for (i in seq_along(left)) {
left[i] <- if (side[i] == -1L)
paste("<", format(signif(splits[i, 4L], digits)))
else if (side[i] == 1L)
paste(">=", format(signif(splits[i, 4L], digits)))
else {
catside <- fit$csplit[splits[i, 4L], 1:side[i]]
paste(c("L", "-", "R")[catside], collapse = "", sep = "")
}
}
cbind(data.frame(var = rownames(splits),
type = type,
node = rep(as.integer(row.names(ff)), times = nn),
ix = rep(seq_len(nrow(ff)), nn),
left = left),
as.data.frame(splits, row.names = F))
}
}
Using this function, I am able to recover all the splitting variables and points:
splits <- rpart_splits(tree)[rpart_splits(tree)$type == "main", ]
splits
# var type node ix left count ncat improve index adj
# 1 X2 main 1 1 < 0.565 5000 -1 0.18110662 0.565 0
# 3 X2 main 2 2 < 0.265 2814 -1 0.06358597 0.265 0
# 6 X1 main 3 5 < 0.645 2186 -1 0.07645851 0.645 0
The column var tells me the splitting variables for each non-terminal node, and the column left tells the associated splitting points. However, I do not know how to use this information to produce my desired plots.
Of course if you have any alternative strategy that do not involve the use of rpart_splits feel free to suggest it.
You could use the (unpublished) parttree package, which you can install from GitHub via:
remotes::install_github("grantmcdermott/parttree")
This allows:
library(parttree)
ggplot() +
geom_parttree(data = tree, aes(fill = path)) +
coord_cartesian(xlim = c(0, 1), ylim = c(0, 1)) +
scale_fill_brewer(palette = "Pastel1", name = "Partitions") +
theme_bw(base_size = 16) +
labs(x = "X2", y = "X1")
Incidentally, this package also contains the function parttree, which returns something very similar to your
rpart_splits function:
parttree(tree)
node Y path xmin xmax ymin ymax
1 4 0.7556079 X2 < 0.565 --> X2 < 0.265 -Inf 0.265 -Inf Inf
2 5 1.3087679 X2 < 0.565 --> X2 >= 0.265 0.265 0.565 -Inf Inf
3 6 1.8681143 X2 >= 0.565 --> X1 < 0.645 0.565 Inf -Inf 0.645
4 7 2.4993361 X2 >= 0.565 --> X1 >= 0.645 0.565 Inf 0.645 Inf
I've been trying to estimate VAR models using Monte Carlo Simulation. I have 3 endogenous variables. I need some guidance regarding this.
First of all, I want to add an outlier as a percentage of the sample size.
Second (second simulation for same model), I want to add multivariate contaminated normal distribution like 0.9N (0, I) + 0.1((0,0,0)',(100, 100, 100)) instead of outlier.
Could you tell me how to do these?
Thank you.
RR <- function(n, out){
# n is number of observations
k <- 3 # Number of endogenous variables
p <- 2 # Number of lags
# add outlier
n[1]<- n[1]+out
# Generate coefficient matrices
B1 <- matrix(c(.1, .3, .4, .1, -.2, -.3, .03, .1, .1), k) # Coefficient matrix of lag 1
B2 <- matrix(c(0, .2, .1, .07, -.4, -.1, .5, 0, -.1), k) # Coefficient matrix of lag 2
M <- cbind(B1, B2) # Companion form of the coefficient matrices
# Generate series
DT <- matrix(0, k, n + 2*p) # Raw series with zeros
for (i in (p + 1):(n + 2*p)){ # Generate series with e ~ N(0,1)
DT[, i] <- B1%*%DT[, i-1] + B2%*%DT[, i-2] + rnorm(k, 0, 1)
}
DT <- ts(t(DT[, -(1:p)])) # Convert to time series format
#names <- c("V1", "V2", "V3") # Rename variables
colnames(DT) <- c("Y1", "Y2", "Y3")
#plot.ts(DT) # Plot the series
# estimate VECM
vecm1 <- VECM(DT, lag = 2, r = 2, include = "const", estim ="ML")
vecm2 <- VECM(DT, lag = 2, r = 1, include = "const", estim ="ML")
# mse
mse1 <- mean(vecm1$residuals^2)
mse2 <- mean(vecm2$residuals^2)
#param_list <- unname(param_list)
return(list("mse1" = mse1, "mse2" = mse2, "mse3" = mse3))
}
# defined the parameter grids(define the parameters ranges we want to run our function with)
n_grid = c(50, 80, 200, 400)
out_grid = c(0 ,5, 10)
# collect parameter grids in a list (to enter it into the Monte Carlo function)
prml = list("n" = n_grid, "out" = out_grid)
# run simulation
RRS <- MonteCarlo(func = RR, nrep = 1000, param_list = prml)
summary(RRS)
# make table:
rows = "n"
cols = "out"
MakeTable(output = RRS, rows = rows, cols = cols)
I am struggling to find examples online as to how lqmm models can be easily plotted. So for example, below, I would like a simple plot where I can predict multiple quantiles and overlay these predictions onto a scatterplot:
library(lqmm)
set.seed(123)
M <- 50
n <- 10
test <- data.frame(x = runif(n*M,0,1), group = rep(1:M,each=n))
test$y <- 10*test$x + rep(rnorm(M, 0, 2), each = n) + rchisq(n*M, 3)
fit.lqm <- lqm(y ~ x , tau=c(0.1,0.5,0.9),data = test)
fit.lqmm <- lqmm(fixed = y ~ x, random = ~ 1, group = group, data = test, tau = 0.5, nK = 11, type = "normal")
I can do this successfully for lqm models, but not lqmm models.
plot(y~x,data=test)
for (k in 1:3){
curve((coef.lqm(fit.lqm)[1,k])+(coef.lqm(fit.lqm)[2,k])*(x), add = TRUE)
}
I have seen the predict.lqmm function, but this returns the predicted value for each x-value in the dataset, rather than a smooth function over the x-axis limit. Thank you in advance for any help.
You get only a single vector for coef.lqmm so you can draw a line with the values:
coef(fit.lqmm)
#(Intercept) x
# 3.443475 9.258331
plot(y~x,data=test)
curve( coef(fit.lqmm)[1] +coef(fit.lqmm)[2]*(x), add = TRUE)
To get the quantile equivalent of normal theory confidence intervals you need to supply tau-vectors. This is for a 90% coverage estimate:
fit.lqmm <- lqmm(fixed = y ~ x, random = ~ 1, group = group, data = test, tau = c(0.05, 0.5, 0.95), nK = 11, type = "normal")
pred.lqmm <- predict(fit.lqmm, level = 1)
str(pred.lqmm)
num [1:500, 1:3] 2.01 7.09 3.24 8.05 8.64 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:500] "1" "2" "3" "4" ...
..$ : chr [1:3] "0.05" "0.50" "0.95"
coef(fit.lqmm)
0.05 0.50 0.95
(Intercept) 0.6203104 3.443475 8.192738
x 10.1502027 9.258331 8.620478
plot(y~x,data=test)
for (k in 1:3){
curve((coef.lqmm(fit.lqmm) [1,k])+(coef.lqmm(fit.lqmm) [2,k])*(x), add = TRUE)
}
If I run a BART model for classification using bartMachine, the returned p_hat_train values correspond to failure probabilities rather than success probabilities as done in the initial implementation of BART in the BayesTree R package.
Here is an example with a simulated binary response:
library(bartMachine)
library(BayesTree)
library(logitnorm)
N = 1000
X <- rnorm(N, 0, 1)
p_true <- invlogit(1.5*X)
y <- rbinom(N, 1, p_true)
## bartMachine
fit <- bartMachine(data.frame(X), as.factor(y), num_burn_in = 200,
num_iterations_after_burn_in = 500)
p_hat <- fit$p_hat_train
## BayesTree
fit2 <- bart(X, as.factor(y), ntree = 50, ndpost = 500)
p_hat2 <- apply(pnorm(fit2$yhat.train), 2, mean)
par(mfrow = c(2,2))
plot(p_hat, p_true, main = 'p_hat_train with bartMachine')
abline(0, 1, col = 'red')
plot(1 - p_hat, p_true, main = '1 - p_hat_train with bartMachine')
abline(0, 1, col = 'red')
plot(p_hat2, p_true, main = 'pnorm(yhat.train) with BayesTree')
abline(0, 1, col = 'red')
Inspecting the iris example from ?bartMachine suggests that bartMachine is estimating the probability that an observation is classified as the first level of the y variable, which in your example happens to be 0. To get your desired result, you'll need to specify levels when you convert y to a factor, i.e.
fit <- bartMachine(data.frame(X), factor(y, levels = c("1", "0")),
num_burn_in = 200,
num_iterations_after_burn_in = 500)
We can see what's going on when we inspect the code for build_bart_machine:
if (class(y) == "factor" & length(y_levels) == 2) {
java_bart_machine = .jnew("bartMachine.bartMachineClassificationMultThread")
y_remaining = ifelse(y == y_levels[1], 1, 0)
pred_type = "classification"
}
And looking at the output from bartMachine (using your original specification) shows what's going on:
head(cbind(fit$model_matrix_training_data, y))
# X y_remaining y
# 1 -0.85093975 0 1
# 2 0.20955263 1 0
# 3 0.66489564 0 1
# 4 -0.09574123 1 0
# 5 -1.22480134 1 0
# 6 -0.36176273 1 0
I am trying to fit a non-linear regression model where the mean-function is the bivariate normal distribution. The parameter to specify is the correlation rho.
The problem: "gradient of first iteration step is singular". Why?
I have here a little example with simulated data.
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
nls(formula= y ~ fun(x1, x2, rho), data= dat, start=list(rho=0.51),
lower=0, upper=1, trace=TRUE)
This yields an error message:
Error in nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
singulärer Gradient
In addition: Warning message:
In nls(formula = y ~ fun(x1, x2, rho), data = dat, start = list(rho = 0.51), :
Obere oder untere Grenzen ignoriert, wenn nicht algorithm= "port"
What I don't understand is
I have only one variable (rho), so there is only one gradient which must be =0 if the matrix of gradients is supposed to be singular. So why should the gradient be =0?
The start value cannot be the problem as I know the true rho=0.5. So the start value =0.51 should be fine, shouldn't it?
The data cannot be completely linear dependent as I added an error term to y.
I would appreciate help very much. Thanks already.
Perhaps "optim" does a better job than "nls":
library(mnormt)
# given values for independent variables
x1 <- c(rep(0.1,5), rep(0.2,5), rep(0.3,5), rep(0.4,5), rep(0.5,5))
x2 <- c(rep(c(0.1,0.2,0.3,0.4,0.5),5))
## 1 generate values for dependent variable (incl. error term)
# from bivariate normal distribution with assumed correlation rho=0.5
fun <- function(b) pmnorm(x = c(qnorm(x1[b]), qnorm(x2[b])),
mean = c(0, 0),
varcov = matrix(c(1, 0.5, 0.5, 1), nrow = 2))
set.seed(123)
y <- sapply(1:25, function(b) fun(b)) + runif(25)/1000
# put it in data frame
dat <- data.frame(y=y, x1=x1, x2=x2 )
# 2 : calculate non-linear regression from the generated data
# use rho=0.51 as starting value
fun <- function(x1, x2,rho) pmnorm(x = c(qnorm(x1), qnorm(x2)),
mean = c(0, 0),
varcov = matrix(c(1, rho, rho, 1), nrow = 2))
f <- function(rho) { sum( sapply( 1:nrow(dat),
function(i){
(fun(dat[i,2],dat[i,3],rho) - dat[i,1])^2
} ) ) }
optim(0.51, f, method="BFGS")
The result is not that bad:
> optim(0.51, f, method="BFGS")
$par
[1] 0.5043406
$value
[1] 3.479377e-06
$counts
function gradient
14 4
$convergence
[1] 0
$message
NULL
Maybe even a little bit better than 0.5:
> f(0.5043406)
[1] 3.479377e-06
> f(0.5)
[1] 1.103484e-05
>
Let's check another start value:
> optim(0.8, f, method="BFGS")
$par
[1] 0.5043407
$value
[1] 3.479377e-06
$counts
function gradient
28 6
$convergence
[1] 0
$message
NULL