Feature selection + cross-validation, but how to make ROC-curves in R - r

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke

You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

Related

How to apply a long list of functions automatically in 10 imputed datasets in R

I have 10 datasets that are the result of multiple imputation, which i named: data1, data2, ..., data10. For each of them, I want to do:
Create a logistic regression model
Do multiple steps which include creating a LASSO model, resampling 200 times from my imputed dataset, recreate LASSO model in each resampling, evaluate measures of performance.
I'm able to do it separately for each dataset but I was wondering if there was a way to automatically do all of the steps for each imputed dataset. Below, I included an example of all the steps I do to get results separately for each imputation.
To do it automatically, i first thought about using lapply to create regressions for every imputation:
log01.1 <- lapply(paste0("data",1:10), function(x){lrm(y ~ x1 + x2 + x3, data=eval(parse(text = x)), x=T, y=T)})
Then I wanted to use lapply again on the whole block of code below with something like :
lapply(log01.1,fun(x){*All the steps following the regression*}
But I realized it doesn't work since lapply can only be applied to one function at a time as I understand it + at model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1, lambda=cv.glmmod$lambda.1se, family="binomial")
it wouldn't work since my lambda would come from a list. And I can't use lapply both on log01.1 and on cv.glmmod at the same time. Add to that the resampling with the 200 repetitons and I'm sure I would run into other problems I can't even think of right now.
And that's about the extent of my knowledge on lapply and other functions that could do similar things. Is there a way to take the chunk of code I wrote below and tell R to repeat it for every one of my 10 imputations and then store into separate lists the objects that would have been created? Or maybe not in lists but I would get for example App1, App2, App3, etc.?
Or am I better off just repeating it 10 times and storing the results?
log01.1 <- lrm(y ~ x1 + x2 + x3 , data=data1, x=T, y=T)})
reps <- 200;App=numeric(reps);Test=numeric(reps)
for(i in 1:reps){
#1.Construct LASSO model in sample i
cv.glmmod <- cv.glmnet(x=log01.1$x, y=log01.1$y, alpha=1, family="binomial")
model.L1 <- glmnet(x=log01.1$x, y=log01.1$y, alpha=1,
lambda=cv.glmmod$lambda.1se, family="binomial") #use optimum penalty
lp1 <- log01.1$x %*% model.L1$beta #for apparent performance
#2. Draw bootstrap sample with replacement from sample i
j <- sample(nrow(data1), replace=T) #for sample Bi
#3. Construct a model in sample Bi replaying every step that was done in the imputed sample
#I, especially model specification steps such as selection of predictors.
#Determine the bootstrap performance as the apparent performance in sample Bi.
#3 Construct LASSO model in sample i replaying every step done in imputed sample i
cv.j <- cv.glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha = 1, family="binomial")
model.L1j <- glmnet (x=log01.1$x[j,], y=log01.1$y[j,], alpha=1,
lambda=cv.j$lambda.1se, family="binomial") #use optimum penalty for Bi
lp1j <- log01.1$x[j,] %*% model.L1j$beta #apparent performance in Bi
App[i] <- lrm.fit(y=log01.1$y[j,], x=lp1j)$stats[6] #apparent c for Bi
#4. Apply model from Bi to the original sample i without any modification to determine the test performance
lp1 <- log01.1$x %*% model.L1j$beta #Validated performance in I
Test[i] <- lrm.fit(y=log01.1$y, x=lp1)$stats[6]} #Test c in I
That is the code I would like to repeat automatically for every imputed set.

weird svm behavior in R (e1071)

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`
You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

How to estimate lambdas of poisson distributed samples in R and to draw Kernel estimation of the density function of the estimator basing on that?

So I have 500 poisson distributed simulated samples with n=100 each.
1) How can I estimate the lambdas for each of these samples separately in R ?
2) How can I draw Kernel Estimation of the density function of the estimator for lambda based on the 500 estimated lambdas? (my guess is somehow with "Kernsmooth" package and function "bkfe" but i fail to programm it normally anyway
taskpois <- function(size, leng){
+ taskmlepois <- NULL
+ for (i in 1:leng){
+ randompois <- rpois(size, 6)
+ taskmlepois[i] <- mean(randompois)
+ }
+ return(taskmlepois)
+ }
tasksample <- taskpois(size=100, leng=500)
As the comments suggest, it seems you're pretty close already.
ltarget <- 2
set.seed(101)
lambdavec <- replicate(500,mean(rpois(100,lambda=ltarget)))
dd <- density(lambdavec)
plot(dd,main="",las=1,bty="l")
We might as well add the expected result based on asymptotic theory:
curve(dnorm(x,mean=2,sd=sqrt(2/100)),add=TRUE,col=2)
We can add another line that shows that the variation among the densities of different experiments is pretty large relative to the difference between the theoretical and observed density from the first experiment:
lambdavec2 <- replicate(500,mean(rpois(100,lambda=ltarget)))
lines(density(lambdavec2),col=4)

how to use previous observations to forecast the next period using for loops in r?

I have made 1000 observations for xt = γ1xt−1 + γ2xt−2 + εt [AR(2)].
What I would like to do is to use the first 900 observations to estimate the model, and use the remaining 100 observations to predict one-step ahead.
This is what I have done so far:
data2=arima.sim(n=1000, list(ar=c(0.5, -0.7))) #1000 observations simulated, (AR (2))
arima(data2, order = c(2,0,0), method= "ML") #estimated parameters of the model with ML
fit2<-arima(data2[1:900], c(2,0,0), method="ML") #first 900 observations used to estimate the model
predict(fit2, 100)
But the problem with my code right now is that the n.ahead=100 but I would like to use n.ahead=1 and make 100 predictions in total.
I think I need to use for loops for this, but since I am a very new user of Rstudio I haven't been able to figure out how to use for loops to make predictions. Can anyone help me with this?
If I've understood you correctly, you want one-step predictions on the test set. This should do what you want without loops:
library(forecast)
data2 <- arima.sim(n=1000, list(ar=c(0.5, -0.7)))
fit2 <- Arima(data2[1:900], c(2,0,0), method="ML")
fit2a <- Arima(data2[901:1000], model=fit2)
fc <- fitted(fit2a)
The Arima command allows a model to be applied to a new data set without the parameters being re-estimated. Then fitted gives one-step in-sample forecasts.
If you want multi-step forecasts on the test data, you will need to use a loop. Here is an example for two-step ahead forecasts:
fcloop <- numeric(100)
h <- 2
for(i in 1:100)
{
fit2a <- Arima(data2[1:(899+i)], model=fit2)
fcloop[i] <- forecast(fit2a, h=h)$mean[h]
}
If you set h <- 1 above you will get almost the same results as using fitted in the previous block of code. The first two values will be different because the approach using fitted does not take account of the data at the end of the training set, while the approach using the loop uses the end of the training set when making the forecasts.

plot multiple fit and predictions for logistic regression

I am running multiple times a logistic regression over more than 1000 samples taken from a dataset. My question is what is the best way to show my results ? how can I plot my outputs for both the fit and the prediction curve?
This is an example of what I am doing, using the baseball dataset from R. For example I want to fit and predict the model 5 times. Each time I take one sample out (for the prediction) and use another for the fit.
library(corrgram)
data(baseball)
#Exclude rows with NA values
dataset=baseball[complete.cases(baseball),]
#Create vector replacing the Leage (A our N) by 1 or 0.
PA=rep(0,dim(dataset)[1])
PA[which(dataset[,2]=="A")]=1
#Model the player be league A in function of the Hits,Runs,Errors and Salary
fit_glm_list=list()
prd_glm_list=list()
for (k in 1:5){
sp=sample(seq(1:length(PA)),30,replace=FALSE)
fit_glm<-glm(PA[sp[1:15]]~baseball$Hits[sp[1:15]]+baseball$Runs[sp[1:15]]+baseball$Errors[sp[1:15]]+baseball$Salary[sp[1:15]])
prd_glm<-predict(fit_glm,baseball[sp[16:30],c(6,8,20,21)])
fit_glm_list[[k]]=fit_glm;prd_glm_list[[k]]=fit_glm
}
There are a number of issues here.
PA is a subset of baseball$League but the model is constructed on columns from the whole baseball data frame, i.e. they do not match.
PA is treated as a continuous response when using the default family (gaussian), it should be changed to a factor and binomial family.
prd_glm_list[[k]]=fit_glm should probably be prd_glm_list[[k]]=prd_glm
You must save the true class labels for the predictions otherwise you have nothing to compare to.
My take on your code looks like this.
library(corrgram)
data(baseball)
dataset <- baseball[complete.cases(baseball),]
fits <- preds <- truths <- vector("list", 5)
for (k in 1:5){
sp <- sample(nrow(dataset), 30, replace=FALSE)
fits[[k]] <- glm(League ~ Hits + Runs + Errors + Salary,
family="binomial", data=dataset[sp[1:15],])
preds[[k]] <- predict(fits[[k]], dataset[sp[16:30],], type="response")
truths[[k]] <- dataset$League[sp[1:15]]
}
plot(unlist(truths), unlist(preds))
The model performs poorly but at least the code runs without problems. The y-axis in the plot shows the estimated probabilities that the examples belong to league N, i.e. ideally the left box should be close to 0 and the right close to 1.

Resources