weird svm behavior in R (e1071) - r

I ran the following code for a binary classification task w/ an SVM in both R (first sample) and Python (second example).
Given randomly generated data (X) and response (Y), this code performs leave group out cross validation 1000 times. Each entry of Y is therefore the mean of the prediction across CV iterations.
Computing area under the curve should give ~0.5, since X and Y are completely random. However, this is not what we see. Area under the curve is frequently significantly higher than 0.5. The number of rows of X is very small, which can obviously cause problems.
Any idea what could be happening here? I know that I can either increase the number of rows of X or decrease the number of columns to mediate the problem, but I am looking for other issues.
Y=as.factor(rep(c(1,2), times=14))
X=matrix(runif(length(Y)*100), nrow=length(Y))
library(e1071)
library(pROC)
colnames(X)=1:ncol(X)
iter=1000
ansMat=matrix(NA,length(Y),iter)
for(i in seq(iter)){
#get train
train=sample(seq(length(Y)),0.5*length(Y))
if(min(table(Y[train]))==0)
next
#test from train
test=seq(length(Y))[-train]
#train model
XX=X[train,]
YY=Y[train]
mod=svm(XX,YY,probability=FALSE)
XXX=X[test,]
predVec=predict(mod,XXX)
RFans=attr(predVec,'decision.values')
ansMat[test,i]=as.numeric(predVec)
}
ans=rowMeans(ansMat,na.rm=TRUE)
r=roc(Y,ans)$auc
print(r)
Similarly, when I implement the same thing in Python I get similar results.
Y = np.array([1, 2]*14)
X = np.random.uniform(size=[len(Y), 100])
n_iter = 1000
ansMat = np.full((len(Y), n_iter), np.nan)
for i in range(n_iter):
# Get train/test index
train = np.random.choice(range(len(Y)), size=int(0.5*len(Y)), replace=False, p=None)
if len(np.unique(Y)) == 1:
continue
test = np.array([i for i in range(len(Y)) if i not in train])
# train model
mod = SVC(probability=False)
mod.fit(X=X[train, :], y=Y[train])
# predict and collect answer
ansMat[test, i] = mod.predict(X[test, :])
ans = np.nanmean(ansMat, axis=1)
fpr, tpr, thresholds = roc_curve(Y, ans, pos_label=1)
print(auc(fpr, tpr))`

You should consider each iteration of cross-validation to be an independent experiment, where you train using the training set, test using the testing set, and then calculate the model skill score (in this case, AUC).
So what you should actually do is calculate the AUC for each CV iteration. And then take the mean of the AUCs.

Related

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Standardization and inclusion of intercept in sparse lasso GLM

I found some problems while practicing the sparse group lasso method using the cvSGL function forom the SGL package.
My questions are as follows:
Looking at the code for SGL:::center_scale, it doesn't seem to consider the sample size of the data.
SGL:::center_scale
#Output
function (X, standardize) {
means <- apply(X, 2, mean)
X <- t(t(X) - means)
X.transform <- list(X.means = means)
if (standardize == TRUE) {
var <- apply(X, 2, function(x) (sqrt(sum(x^2))))
X <- t(t(X)/var)
X.transform$X.scale <- var
}
else {
X.transform$X.scale <- 1
}
return(list(x = X, X.transform = X.transform))
}
Therefore, the sample standard deviation of the predicted variable is measured somewhat larger.
Is my understanding correct that this may cause the coefficients to be excessively large?
whether the model can be estimated by SGL package with intercept term (or constant term)
The SGL package does not seem to provide a function for estimating by including a intercept term.
In cvFit[["fit"]], I can see only the beta of the predict variables according to the lambda's except for the constant term. The value of cvFit[["fit"]][["intercept"]] is the mean of the y variable.
It can be estimated by adding 1 to first column of predict variable X, but in this case, it is expected to cause problems in centering and standardizing predict variables.
In addition, the SPG package seems to add a penalty to all predict variables. Even if the estimation is performed by adding 1 to the first column of the explanatory variable X as described above, the constant term may be estimated as 0.

Using anova() on gamma distributions gives seemingly random p-values

I am trying to determine whether there is a significant difference between two Gamm distributions. One distribution has (shape, scale)=(shapeRef,scaleRef) while the other has (shape, scale)=(shapeTarget,scaleTarget). I try to do analysis of variance with the following code
n=10000
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
glmm1 <- gam(y~x,family=Gamma(link=log))
anova(glmm1)
The resulting p values keep changing and can be anywhere from <0.1 to >0.9.
Am I going about this the wrong way?
Edit: I use the following code instead
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
anova(glm(xy ~ f, family = Gamma(link = log)),test="F")
But, every time I run it I get a different p-value.
You will indeed get a different p-value every time you run this, if you pick different realizations every time. Just like your data values are random variables, which you'd expect to vary each time you ran an experiment, so is the p-value. If the null hypothesis is true (which was the case in your initial attempts), then the p-values will be uniformly distributed between 0 and 1.
Function to generate simulated data:
simfun <- function(n=100,shapeRef=2,shapeTarget=2,
scaleRef=1,scaleTarget=2) {
f <- gl(2, n)
x=rgamma(n, shape=shapeRef, scale=scaleRef)
y=rgamma(n, shape=shapeTarget, scale=scaleTarget)
xy <- c(x, y)
data.frame(xy,f)
}
Function to run anova() and extract the p-value:
sumfun <- function(d) {
aa <- anova(glm(xy ~ f, family = Gamma(link = log),data=d),test="F")
aa["f","Pr(>F)"]
}
Try it out, 500 times:
set.seed(101)
r <- replicate(500,sumfun(simfun()))
The p-values are always very small (the difference in scale parameters is easily distinguishable), but they do vary:
par(las=1,bty="l") ## cosmetic
hist(log10(r),col="gray",breaks=50)

Feature selection + cross-validation, but how to make ROC-curves in R

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?
Silke
You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:
library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])
all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
test = aSAH[indices==i,]
learn = aSAH[indices!=i,]
model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
model.pred <- predict(model, newdata=test)
aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
all.response <- c(all.response, test$outcome)
all.predictor <- c(all.predictor, model.pred)
}
roc(all.response, all.predictor)
mean(aucs)
The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

Bootstrapping to compare two groups

In the following code I use bootstrapping to calculate the C.I. and the p-value under the null hypothesis that two different fertilizers applied to tomato plants have no effect in plants yields (and the alternative being that the "improved" fertilizer is better). The first random sample (x) comes from plants where a standard fertilizer has been used, while an "improved" one has been used in the plants where the second sample (y) comes from.
x <- c(11.4,25.3,29.9,16.5,21.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
library(boot)
diff <- function(x,i) mean(x[i[6:11]]) - mean(x[i[1:5]])
b <- boot(total, diff, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
What I don't like about the code above is that resampling is done as if there was only one sample of 11 values (separating the first 5 as belonging to sample x leaving the rest to sample y).
Could you show me how this code should be modified in order to draw resamples of size 5 with replacement from the first sample and separate resamples of size 6 from the second sample, so that bootstrap resampling would mimic the “separate samples” design that produced the original data?
EDIT2 :
Hack deleted as it was a wrong solution. Instead one has to use the argument strata of the boot function :
total <- c(x,y)
id <- as.factor(c(rep("x",length(x)),rep("y",length(y))))
b <- boot(total, diff, strata=id, R = 10000)
...
Be aware you're not going to get even close to a correct estimate of your p.value :
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
total <- c(x,y)
b <- boot(total, diff, strata=id, R = 10000)
ci <- boot.ci(b)
p.value <- sum(b$t>=b$t0)/b$R
> p.value
[1] 0.5162
How would you explain a p-value of 0.51 for two samples where all values of the second are higher than the highest value of the first?
The above code is fine to get a -biased- estimate of the confidence interval, but the significance testing about the difference should be done by permutation over the complete dataset.
Following John, I think the appropriate way to use bootstrap to test if the sums of these two different populations are significantly different is as follows:
x <- c(1.4,2.3,2.9,1.5,1.1)
y <- c(23.7,26.6,28.5,14.2,17.9,24.3)
b_x <- boot(x, sum, R = 10000)
b_y <- boot(y, sum, R = 10000)
z<-(b_x$t0-b_y$t0)/sqrt(var(b_x$t[,1])+var(b_y$t[,1]))
pnorm(z)
So we can clearly reject the null that they are the same population. I may have missed a degree of freedom adjustment, I am not sure how bootstrapping works in that regard, but such an adjustment will not change your results drastically.
While the actual soil beds could be considered a stratified variable in some instances this is not one of them. You only have the one manipulation, between the groups of plants. Therefore, your null hypothesis is that they really do come from the exact same population. Treating the items as if they're from a single set of 11 samples is the correct way to bootstrap in this case.
If you have two plots, and in each plot tried the different fertilizers over different seasons in a counterbalanced fashion then the plots would be statified samples and you'd want to treat them as such. But that isn't the case here.

Resources