I am trying to use a for-loop as a repeat counter to add summary data to a test sample. I have tried to use a data.frame, matrix, and a vector push my data out of the for loop and populate a table. The best I have got is filling one complete column in a vector and completing all columns but one row in a data frame.
#try empty vector to populate
large.sample.df <- vector(mode = "double", length = 1000)
#try matrix to populate
large.matrix <- matrix(nrow = 1000, ncol = 3)
matrix.names <- c("mean", "lwr", "upr")
colnames(large.matrix) <- matrix.names
#Try dataframe to populate
large.df <- data.frame(mean="", lwr="", upr="")
#set total length
n <- length(large.sample.df)
#use functions to calculate confidence interval
lwr.ci <- function(a) (mean(a) - 1.96 * (sd(a)/sqrt(length(a))))
upp.ci <- function(a) (mean(a) + 1.96 * (sd(a)/sqrt(length(a))))
#Start new seed count
#begin for loop for mean, lwr, upr CI
for (i in 1:n) {
large.sample <- rgamma(n = 1000, shape = 4, rate = 2)
large.df$mean[i] <- mean(large.sample)
large.df$lwr[i] <- lwr.ci(large.sample)
large.df$upr[i] <- upp.ci(large.sample)
Here are two ways to get what you want. First we should distinguish between the sample size and the number of samples:
n <- 1000
samples <- 10 # Keep this small for testing and then increase it
s <- 4
r <- 2
First your loop approach:
results <- data.frame(mean=NA, lwr=NA, upr=NA) # Not "" which makes the variables character strings
for (i in 1:samples) {
x <- rgamma(n, shape = s, rate = r)
mn <- mean(x)
sder <- sd(x)/sqrt(n)
lwr <- mn - 1.96 * sder
upr <- mn + 1.96 * sder
results[i, ] <- c(mn, lwr, upr)
# mean lwr upr
# 1 2.015193688 1.952431714 2.077955663
# 2 2.024218250 1.962404608 2.086031891
# 3 2.008401293 1.948363928 2.068438658
# 4 1.993061142 1.932020588 2.054101696
# 5 1.975824831 1.912961486 2.038688176
# 6 1.983761126 1.923583927 2.043938325
# 7 1.983166350 1.924890819 2.041441880
# 8 1.975453269 1.915336118 2.035570420
# 9 1.976118333 1.915025748 2.037210918
# 10 2.044088839 1.983435628 2.104742050
Now using replicate
confint <- function(n, s, r) {
x <- rgamma(n, shape = s, rate = r)
mn <- mean(x)
sder <- sd(x)/sqrt(n)
lwr <- mn - 1.96 * sder
upr <- mn + 1.96 * sder
return(c(mean=mn, lwr=lwr, upr=upr))
confint(n, s, r) # Test the function
# mean lwr upr
# 1.974328366 1.914003710 2.034653023
results <- replicate(samples, confint(n, s, r))
results <- t(results)
# mean lwr upr
# [1,] 2.015193688 1.952431714 2.077955663
# [2,] 2.024218250 1.962404608 2.086031891
# [3,] 2.008401293 1.948363928 2.068438658
# [4,] 1.993061142 1.932020588 2.054101696
# [5,] 1.975824831 1.912961486 2.038688176
# [6,] 1.983761126 1.923583927 2.043938325
# [7,] 1.983166350 1.924890819 2.041441880
# [8,] 1.975453269 1.915336118 2.035570420
# [9,] 1.976118333 1.915025748 2.037210918
# [10,] 2.044088839 1.983435628 2.104742050
Both approaches agree.
Create a simulated dataset of 100 observations, where x is a random normal variable with mean 0 and standard deviation 1, and y = 0.1 + 2 * X + e, where epsilon is also a random normal error with mean 0 and sd 1.
# simulate a data set of 100 observations
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Now extract the first 5 observations.
y1.FirstFive <- (y.1[1:5]) # extract first 5 observations from y
x.FirstFive <- (x[1:5]) # extract first 5 observations from x
y1.FirstFive # extracted 5 observations from y1
[1] -1.7732743 0.5094025 -2.4821789 3.4485904 0.1044309
x.FirstFive # extracted 5 observations from x
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078
Assuming the mean and sd of the sample that you calculated from the first five observations would not change, what is the minimum total number of additional observations you would need to be able to conclude that the true mean of the population is different from 0 at the p = 0.01 confidence level?
alpha <- 0.01
mu <- 0
for (i in 5:2000) {
# Recalculate the standard error and CI
stand_err <- Sd_y1 / sqrt(i)
ci <- sample_mean_y1 + c(qt(alpha/2, i-1), qt(1-alpha/2, i-1))*stand_err
if (ci[2] < mu)
break # condition met, exit loop
[1] 2000
Here, I wrote a loop that iteratively increases n from the initial n=5 to n=2000, uses pt to find the p value (given a fixed y-bar and sd), and stops when p < 0.01. However I keep getting the wrong output. Such that, the output is always the number of the maximum range that I give (here, it is 2000) instead of giving me the specific minimum n sample in order to reject the null that mu_y = 0 at the p=0.01 level. Any suggestions as to how to fix the code?
additional info: the sd of y1.FirstFive = 2.3 and mean of y1.FirstFive = -0.04
Sd_y1 = sd(y1.FirstFive)
sample_mean_y1 = mean(y1.FirstFive)
[1] -0.03860587
As pointed out by #jblood94, you need to go for larger sample size.
You don't need a for loop for this, most of your functions are vectorized, so something like this:
n = 5:30000
stand_err = Sd_y1 / sqrt(n)
ub = sample_mean_y1 + qt(1-alpha/2, n-1)*stand_err
[1] 23889
It's because n > 2000.
x <- rnorm(100)
y.1 <- 0.1 + 2*x + rnorm(100)
Sd_y1 <- sd(y.1[1:5])
sample_mean_y1 <- mean(y.1[1:5])
alpha <- 0.01
sgn <- 2*(sample_mean_y1 > 0) - 1
f <- function(n) qt(alpha/2, n - 1)*Sd_y1 + sgn*sample_mean_y1*sqrt(n)
upper <- 2
while (f(upper) < 0) upper <- upper*2
(n <- ceiling(uniroot(f, lower = upper/2, upper = upper, tol = 0.5)$root))
#> [1] 23889
I tried to implement the AdaBoost algorithm of Freund and Schapire as close to the original as possible (see p. 2 here: http://rob.schapire.net/papers/explaining-adaboost.pdf):
maxdepth <- 1
T <- 100 # number of rounds
# Given: (x_1, y_1),...,(x_m, y_m) where x_i element of X, y_i element of {-1, +1}
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
#myocarde <- read.table("data/myocarde.csv", header = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
data <- data.frame(x, y)
# Initialize: D_1(i) = 1/m for i = 1,...,m
D <- rep(1/m, m)
H <- replicate(T, list())
a <- vector(mode = "numeric", T)
# For t = 1,...,T
for(t in 1:T) {
# Train weak learner using distribution D_t
# Get weak hypothesis h_t: X -> {-1, +1}
data_D_t <- data[sample(m, 10*m, replace = TRUE, prob = D), ]
H[[t]] <- rpart(y ~., data = data_D_t, maxdepth = maxdepth, method = "class")
# Aim: select h_t with low weighted error: e_t = Pr_i~D_t[h_t(x_i) != y_i]
h <- predict(H[[t]], x, type = "class")
e <- sum(h != y) / m
# Choose a_t = 0.5 * log((1-e) / e)
a[t] <- 0.5 * log((1-e) / e)
# Update for i = 1,...,m: D_t+1(i) = (D_t(i) * exp(-a_t * y_i * h_t(x_i))) / Z_t
# where Z_t is a normalization factor (chosen so that Dt+1 will be a distribution)
D <- D * exp(-a[t] * y * as.numeric(h))
D <- D / sum(D)
# Output the final hypothesis: H(x) = sign(sum of a_t * h_t(x) for t=1 to T)
newdata <- x
H_x <- sapply(H, function(x) as.numeric(as.character(predict(x, newdata = newdata, type = "class"))))
H_x <- t(a * t(H_x))
pred <- sign(rowSums(H_x))
eval_model(pred, y)
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 0 1 1
## 1 29 41 70
## Sum 29 42 71
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.00 0.01 0.01
## 1 0.41 0.58 0.99
## Sum 0.41 0.59 1.00
## Accuracy:
## 0.5775 (41/71)
## Error rate:
## 0.4225 (30/71)
## Error rate reduction (vs. base rate):
## -0.0345 (p-value = 0.6436)
As can be seen the accuracy of the model is horrible compared to other AdaBoost implementations, e.g.:
## JOUSBoost 2.1.0
boost <- adaboost(as.matrix(x), y, tree_depth = maxdepth, n_rounds = T)
pred <- predict(boost, x)
eval_model(pred, y)
## Confusion matrix (absolute):
## Actual
## Prediction -1 1 Sum
## -1 29 0 29
## 1 0 42 42
## Sum 29 42 71
## Confusion matrix (relative):
## Actual
## Prediction -1 1 Sum
## -1 0.41 0.00 0.41
## 1 0.00 0.59 0.59
## Sum 0.41 0.59 1.00
## Accuracy:
## 1 (71/71)
## Error rate:
## 0 (0/71)
## Error rate reduction (vs. base rate):
## 1 (p-value < 2.2e-16)
My question
Could you please give me a hint what went wrong in my implementation? Thank you
The final and corrected code can be found in my blog post: Understanding AdaBoost – or how to turn Weakness into Strength
There are quite a few contributing factors as to why your implementation is not working.
You were not using rpart correctly. Adaboost implementation does not mention upsampling with the weights - but rpart itself can accept weights. My example below shows how rpart should be used for this purpose.
Calculation of the weighted error was wrong. You were calculating the error proportion (number of samples calculated incorrectly divided by number of samples). Adaboost uses the sum of the weights that were incorrectly predicted (sum(D[y != yhat])).
Final predictions seemed to be incorrect too, I just ended up doing a simple loop.
Next time I recommend diving into the source code the the other implementations you are comparing against.
https://github.com/cran/JOUSBoost/blob/master/R/adaboost.R uses almost identical code to my below example - and probably would have helped guide you originally.
Additionally using T as a variable could potentially interfere with the logical TRUE and it's shorthand T, so I'd avoid it.
### packages ###
### parameters ###
maxdepth <- 1
rounds <- 100
### data ###
myocarde <- read.table("http://freakonometrics.free.fr/myocarde.csv", head = TRUE, sep = ";")
y <- (myocarde[ , "PRONO"] == "SURVIE") * 2 - 1
x <- myocarde[ , 1:7]
m <- nrow(x)
dataset <- data.frame(x, y)
### initialisation ###
D <- rep(1/m, m)
H <- list()
a <- vector(mode = "numeric", length = rounds)
for (i in seq.int(rounds)) {
# train weak learner
H[[i]] = rpart(y ~ ., data = dataset, weights = D, maxdepth = maxdepth, method = "class")
# predictions
yhat <- predict(H[[i]], x, type = "class")
yhat <- as.numeric(as.character(yhat))
# weighted error
e <- sum(D[yhat != y])
# alpha coefficient
a[i] <- 0.5 * log((1 - e) / e)
# updating weights (D)
D <- D * exp(-a[i] * y * yhat)
D <- D / sum(D)
# predict with each weak learner on dataset
y_hat_final <- vector(mode = "numeric", length = m)
for (i in seq(rounds)) {
pred = predict(H[[i]], dataset, type = "class")
pred = as.numeric(as.character(pred))
y_hat_final = y_hat_final + (a[i] * pred)
pred <- sign(y_hat_final)
eval_model(pred, y)
> eval_model(pred, y)
Confusion matrix (absolute):
Prediction -1 1 Sum
-1 29 0 29
1 0 42 42
Sum 29 42 71
Confusion matrix (relative):
Prediction -1 1 Sum
-1 0.41 0.00 0.41
1 0.00 0.59 0.59
Sum 0.41 0.59 1.00
1 (71/71)
Error rate:
0 (0/71)
Error rate reduction (vs. base rate):
1 (p-value < 2.2e-16)
I'm looking to simulate an age variable (constrained range 18-35) that is correlated 0.1 with an existing binary variable called use. Most of the examples I've come across demonstrate how to simulate both variables simultaneously.
# setup
n <- 134
dat <- data.frame(partID=seq(1, n, 1),
trt=c(rep(0, n/2),
rep(1, n/2)))
# set proportion
a <- .8
b <- .2
dat$use <- c(rbinom(n/2, 1, b),
rbinom(n/2, 1, a))
Not sure if this is the best way to approach this, but you might get close using the answer from here: https://stats.stackexchange.com/questions/15011/generate-a-random-variable-with-a-defined-correlation-to-an-existing-variable
For example (using the code from the link):
x1 <- dat$use # fixed given data
rho <- 0.1 # desired correlation = cos(angle)
theta <- acos(rho) # corresponding angle
x2 <- rnorm(n, 2, 0.5) # new random data
X <- cbind(x1, x2) # matrix
Xctr <- scale(X, center=TRUE, scale=FALSE) # centered columns (mean 0)
Id <- diag(n) # identity matrix
Q <- qr.Q(qr(Xctr[ , 1, drop=FALSE])) # QR-decomposition, just matrix Q
P <- tcrossprod(Q) # = Q Q' # projection onto space defined by x1
x2o <- (Id-P) %*% Xctr[ , 2] # x2ctr made orthogonal to x1ctr
Xc2 <- cbind(Xctr[ , 1], x2o) # bind to matrix
Y <- Xc2 %*% diag(1/sqrt(colSums(Xc2^2))) # scale columns to length 1
x <- Y[ , 2] + (1 / tan(theta)) * Y[ , 1] # final new vector
dat$age <- (1 + x) * 25
cor(dat$use, dat$age)
# 0.1
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 20.17 23.53 25.00 25.00 26.59 30.50
I have correlated one set number with .9, .5, .0
A derives from rnorm(30,-0.5,1)
B derives from rnorm(30,.5,2)
and want to make A & B correlated with .9, .5, .0.
You are describing a multivariate normal distribution, which can be computed with the mvrnorm function:
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
vals <- mvrnorm(10000, c(meanA, meanB), matrix(c(sdA^2, correlation*sdA*sdB,
correlation*sdA*sdB, sdB^2), nrow=2))
# [1] -0.4883265
# [1] 0.5201586
# [1] 0.9994628
# [1] 1.992816
cor(vals[,1], vals[,2])
# [1] 0.8999285
As an alternative, please consider the following. Let the random variables X ~ N(0,1) and Y ~ N(0,1) independently. Then the random variables X and rho X + sqrt(1 - rho^2) Y are both distributed N(0,1), but are now correlated with correlation rho. So possible R code could be
# Define the parameters
meanA <- -0.5
meanB <- 0.5
sdA <- 1
sdB <- 2
correlation <- 0.9
n <- 10000 # You want 30
# Generate from independent standard normals
x <- rnorm(n, 0, 1)
y <- rnorm(n, 0, 1)
# Transform
x2 <- x # could be avoided
y2 <- correlation*x + sqrt(1 - correlation^2)*y
# Fix up means and standard deviations
x3 <- meanA + sdA*x2
y3 <- meanB + sdB*y2
# Check summary statistics
# [1] -0.4981958
# [1] 0.4999068
# [1] 1.014299
# [1] 2.022377
cor(x3, y3)
# [1] 0.9002529
I created the correlate package to be able to create a correlation between any type of variable (regardless of distribution) given a certain amount of toleration. It does so by permutations.
A <- rnorm(30, -0.5, 1)
B <- rnorm(30, .5, 2)
C <- correlate(cbind(A,B), 0.9)
# 0.9012749
D <- correlate(cbind(A,B), 0.5)
# 0.5018054
E <- correlate(cbind(A,B), 0.0)
# -0.00407327
You can pretty much decide the whole matrix if you want (for multiple variables), by giving a matrix as second argument.
Ironically, you can also use it to create a multivariate normal.....
I have written the code below to obtain a bootstrap estimate of a mean. My objective is to view the numbers selected from the data set, ideally in the order they are selected, by the function boot in the boot package.
The data set only contains three numbers: 1, 10, and 100 and I am only using two bootstrap samples.
The estimated mean is 23.5 and the R code below indicates that the six numbers included one '1', four '10' and one '100'. However, there are 30 possible combinations of those numbers that would have resulted in a mean of 23.5.
Is there a way for me to determine which of those 30 possible combinations is the combination that actually appeared in the two bootstrap samples?
dat <- c(1, 10, 100)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 2)
# Call:
# boot(data = dat, statistic = av, R = 2)
# Bootstrap Statistics :
# original bias std. error
# t1* 37 -13.5 19.09188
mean(dat) + -13.5
# [1] 23.5
# The two samples must have contained one '1', four '10' and one '100',
# but there are 30 possibilities.
# Which of these 30 possible sequences actual occurred?
# This code shows there must have been one '1', four '10' and one '100'
# and shows the 30 possible combinations
my.combos <- expand.grid(V1 = c(1, 10, 100),
V2 = c(1, 10, 100),
V3 = c(1, 10, 100),
V4 = c(1, 10, 100),
V5 = c(1, 10, 100),
V6 = c(1, 10, 100))
my.means <- apply(my.combos, 1, function(x) {( (x[1] + x[2] + x[3])/3 + (x[4] + x[5] + x[6])/3 ) / 2 })
possible.samples <- my.combos[my.means == 23.5,]
n.1 <- rowSums(possible.samples == 1)
n.10 <- rowSums(possible.samples == 10)
n.100 <- rowSums(possible.samples == 100)
length(unique(n.1)) == 1
length(unique(n.10)) == 1
length(unique(n.100)) == 1
I think you can determine the numbers sampled and the order in which they are sampled with the code below. You have to extract the function ordinary.array from the boot package and paste that function into your R code. Then specify the values for n, R and strata, where n is the number of observations in the data set and R is the number of replicate samples you want.
I do not know how general this approach is, but it worked with a couple of simple examples I tried, including the example below.
dat <- c(1, 10, 100, 1000)
av <- function(dat, i) { sum(dat[i])/length(dat[i]) }
av.boot <- boot(dat, av, R = 3)
# Call:
# boot(data = dat, statistic = av, R = 3)
# Bootstrap Statistics :
# original bias std. error
# t1* 277.75 -127.5 132.2405
mean(dat) + -127.5
# [1] 150.25
# boot:::ordinary.array
ordinary.array <- function (n, R, strata)
inds <- as.integer(names(table(strata)))
if (length(inds) == 1L) {
output <- sample.int(n, n * R, replace = TRUE)
dim(output) <- c(R, n)
else {
output <- matrix(as.integer(0L), R, n)
for (is in inds) {
gp <- seq_len(n)[strata == is]
output[, gp] <- if (length(gp) == 1)
rep(gp, R)
else bsample(gp, R * length(gp))
# I think the function ordinary.array determines which elements
# of the data are sampled in each of the R samples
# [,1] [,2] [,3] [,4]
# [1,] 1 3 1 3
# [2,] 3 4 1 3
# [3,] 3 3 3 3
# which equals:
((1+100+1+100) / 4 + (100+1000+1+100) / 4 + (100+100+100+100) / 4) / 3
# [1] 150.25