Related
I want to be able to input the variable name that I'll be using in the "weights" option in the lmer function. So then I can change the dataset, and cycle through the "weights" and pull the correct variable.
I want to pull the correct column for weights within the for loop.
So for y, the equation would be:
lmer(y~x+(1|study), weights = weight.var)
And y1:
lmer(y1~x+(1|study),weights = weight.var1)
So I named the weighting variables (weight.opt), then want to use them in the formula within the for loop. I can use "as.formula" to get the formula working and connected to the dataset, but I'm not sure how to do something similar with the weights.
x <- rnorm(300,0,1)
y <- x*rnorm(300,2,0.5)
y1 <- x*rnorm(300,0.1,0.1)
study <- rep(c("a","b","c"),each = 100)
weight.var <- rep(c(0.5,2,4),each = 100)
weight.var1 <- rep(c(0.1,.2,.15),each = 100)
library(lme4)
dataset <- data.frame(x,y,y1,study,weight.var,weight.var1)
resp1 <- c("y","y1")
weight.opt <- c("weight.var","weight.var1")
for(i in 1:2){
lmer(as.formula(paste(resp1[i],"~x+(1|study)")),weights = weight.opt[i],data = dataset)
}
This seems to work fine:
res_list <- list()
for(i in 1:2){
res_list[[i]] <- lmer(as.formula(paste(resp1[i],"~x+(1|study)")),
weights = dataset[[weight.opt[i]]],data = dataset)
}
I'm running the classification method Bagging Tree (Bootstrap Aggregation) and compare this misclassification error rate with one from one single tree.
It's strange to me because the function estim.pred returns a matrix of factors that map to "pos" and "neg", but res.boot$t returns a matrix of integers taking on the values of 1 or 2, where as estim.pred is the statistic of res.boot$t.
Could you please explain the reason for this phenomenon?
library(rpart)
library(boot)
library(mlbench)
data(PimaIndiansDiabetes)
n <- 768
ntrain <- 468
ntest <- 300
B <- 100
M <- 100
train.error <- vector(length = M)
test.error <- vector(length = M)
bagging.error <- vector(length = M)
estim.pred <- function(a.sample, vector.of.indices)
{
current.train <- a.sample[vector.of.indices, ]
current.fitted.model <- rpart(diabetes ~ ., data = current.train, method = "class")
predict(current.fitted.model, test.set, type = "class")
}
fitted.tree <- rpart(diabetes ~ ., data = train.set, method = "class")
pred.train <- predict(fitted.tree, train.set, type = "class")
res.boot = boot(train.set, estim.pred, B)
head(pred.train)
head(res.boot$t)
Here is #Roland comment. I post it here to remove my question from unanswered list.
res.boot$t is a matrix. A matrix cannot contain a factor variable. Thus, the matrix contains the underlying integer values. Transpose the matrix, turn it into a data.frame and turn the integers into factor variables with your levels.
Here is my code:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
This is the light saturation curve of photosystem II in Chlamydomonas reinhardtii. I would like to find the best fitting for my curve using the Michaelis-Menten distribution model. I tried with the drm() command in this way :
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2())
When I run this code the calculation of the fitting starts, but it's interrupted by an error that I do not really comprehend:
Error in parse(text = paste(paste(rep("c(", nrep - 1), collapse = ""), :
<text>:2:39: unexpected ')'
1: mu[(1+( 1 * (i - 1))),] %*%
2: mu[( 2 + ( 1 * (i - 1))),drop=FALSE,])
^
In addition: Warning message:
In cbind(mu[, 2:(nclass - 1)], 1) - mu[, seq(nclass - 1)] :
longer object length is not a multiple of shorter object length
Timing stopped at: 0 0 0
Although I will keep trying solve the problem by myself, I would really appreciate if someone could help me fixing it quicker or finding an alternative way to perform the analysis.
Thanks in advance!
Thanks to the help of a friend here follows the answer:
data <-data.frame(matrix(0,nrow = 9,ncol = 2))
data[,1] <- c(0,15,41,81,146,211,438,958,1733)
data[,2] <-c(0.000000,5.7013061,13.2662515,26.0874534,42.2710547,55.6050052,75.597474,112.6755999,109.45890071)
rownames(data) <- c("E0_TAP","E3_TAP","E4_TAP","E5_TAP","E6_TAP","E7_TAP","E8_TAP","E10_TAP","E12_TAP")
colnames(data) <- c('S','v')
data <- t(data) #traspose
data1 <- cbind(data,data) #duplicate
data1 <- cbind(data1,data1) # quadruplicate
data <- as.data.frame(t(data1)) #transpose
model.drm <- drm (v ~ cluster(S), data = data, fct = MM.2()) #fitting analysis
S <- data[,1]
v <- data[,2]
mml <- data.frame(S = seq(0, max(S)+9000, length.out = 200))
mml$v <- predict(model.drm, newdata = mml)
s <- mml[,1]
v <- mml[,2]
plot(s,v)
lines(s,v,lty=2,col="red",lwd=3)
coeff <- as.data.frame(coef(summary(model.drm)))
The issue comes from the dataset itself. To bypass the error, a n-uplication of my data was needed. I assume that it would be even better having more replicas of the experiment instead of cloning the selfsame.
Please leave a comment!
while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.
I have created the correct number of indeces for my vectors and I am trying to input the i'th element from the for loop as the index to hold the classification error value. But I get the error:
Error in indeces.gen_error[[i]] <- paste(classification_error) :
attempt to select less than one element
uspscl.txt
uspsdata.txt
My code:
library(e1071)
library(caret)
set.seed(733)
uspscldf = read.table('uspscl.txt', header=F, sep=',')
uspsdatadf = read.table('uspsdata.txt', header=F, sep='\t')
trainIndex <- createDataPartition(uspscldf$V1,list=FALSE, p = .80,times=1)
dataTrain <- uspsdatadf[ trainIndex,]
dataTest <- uspsdatadf[-trainIndex,]
classTrain <- uspscldf[ trainIndex,]
classTest <- uspscldf[-trainIndex,]
indeces = seq(0.00001, 1, by=0.001)
indeces.gen_error = NULL
indeces.softmargin = NULL
for (i in seq(0.00001, 1, by=0.001)){
# For svm(): soft margin is "cost"
# Gaussian kernel bandwidth (sigma) = is implicitly defined by "gamma"
# kernal=radial is non-linear while kernal=linear is linear
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = i,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[i]] <- paste(classification_error)
indeces.softmargin[[i]]<-i
}
I printed the first i in the first iteration and it give 1e-5 which is correct so I am at a loss as to why it says I am selecting less than one element.
Any help would be appreciated. Thanks
ANSWER:::
I did not see Pierre's answer to this until I solved the answer myself but his explanation is better so I am accepting his answer. My new code now is:
indeces = seq(0.00001, 1, by=0.001)
indeces.gen_error = NULL
indeces.softmargin = NULL
count=0
for (i in indeces){
count=count+1
# For svm(): soft margin is "cost"
# Gaussian kernel bandwidth (sigma) = is implicitly defined by "gamma"
# kernal=radial is non-linear while kernal=linear is linear
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = i,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[count]] <- paste(classification_error)
indeces.softmargin[[count]]<-i
}
#Example
x <- NULL
for( i in seq(0.01, 1, .01)) {
a <- 10 * i
x[[i]] <- paste("b", a)
}
# Error in x[[i]] <- paste("b", a) :
# attempt to select less than one element
#The right way
x <- NULL
myseq <- seq(0.01, 1, 0.01)
for( i in 1:length(myseq)) {
a <- 10 * myseq[i]
x[i] <- paste("b", a)
}
Why the first way fails for( i in seq(0.01, 1, .01)) will use the sequence as 'i'. Anytime when a loop fails, the first way to troubleshoot is to try out each loop one by one. So each loop takes a value of the sequence and enters it wherever there is an i. The first loop looks like:
for (i in seq(0.00001, 1, by=0.001)){
svm.model <- svm(classTrain ~ ., data = dataTrain, cost = 0.00001,type="C-classification",kernal = "linear")
svm.pred <- predict(svm.model, dataTrain)
# confusion matrix
tab <- table(pred = svm.pred, true = classTrain)
classification_error <- 1- sum(svm.pred == classTrain)/length(svm.pred)
indeces.gen_error[[0.00001]] <- paste(classification_error)
indeces.softmargin[[0.00001]]<- 0.00001
}
Do you see the problem? With indeces.gen_error[[0.00001]] pay attention to what is happening here. You did not mean to do this. You meant for indeces.gen_error[[1]] to be the first entry.
You are subsetting by a decimal. If we had:
x <- 1:10
What do you think would happen with x[2.5]? We are asking R for the element at the position 2.5. That doesn't make sense. There is no half position. There's either the 2nd or 3rd. Try it to see what is returned.
In your loop you are asking R for indeces.gen_error[[0.00001]]. Therefore, you are requesting the 1/100,000th position. That doesn't make sense. The evaluator will force the subset to be an integer. It goes to 0. And we get an error.