Change threshold for class prediction in R - r

How can I change the probability threshold to predict a class as 1 in R.
In rapidminer there is apply threshold operator. How can I achieve the same thing in R?
svm_model1 <- svm(x,y,probability = TRUE)
summary(svm_model1)
pred <- predict(svm_model1,x,probability = TRUE)

The model gives as output a vector of probabilities, only compare the output with a theshold in the case of a binary classification.

Related

Why does the results of the bootstrapping methods differs if it is being used the same seed?

I want to generate 95% confidence intervals from the R2 of a linear model. While developing the code and using the same seed for both approaches, I figured it out that doing the bootstrap manually doesn't give me the same results as using the boot function from the boot package. I am wondering now if I am doing something wrong? or why is this happening?
On the other hand, in order to calculate the 95% CI I was trying to use the confint function, but I'm getting an error "$ operator is invalid for atomic vectors". Any solution to avoid this error?
Here is a reproducible example to explain my concerns
#creating the dataframe
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
#bootstrapping manually
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(100, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
names(B_manually)[1]<- "r_squared"
#Bootstrapping using the function "Boot" from Boot library
set.seed(123)
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=100)
head(B_manually) == head(B_boot$t)
r_squared
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
#Why does the results of the manually vs boot function approach differs if I'm using the same seed?
# 2nd question (Using the confint function to determine the 95 CI gives me an error)
confint(B_manually$r_squared, level = 0.95, method = "quantile")
confint(B_boot$t, level = 0.95, method = "quantile")
#Error: $ operator is invalid for atomic vectors
#NOTE: I already used the boot.ci to determine the 95 confidence interval, as well as the
#quantile function to determine the CI, but the results of these CI differs from each others
#and just wanted to compare with the confint function.
quantile(B_function$t, c(0.025,0.975))
boot.ci(B_function, index=1,type="perc")
Thanks in advance for any help!
The boot package does not use replicate with sample to generate the indices. Check the importance.array function under the source code for boot. It basically generates all the indices at one go. So there's no reason to assume that you will end up with the same indices or same result. Take a step back, the purpose of bootstrap is to use random sampling methods to obtain a estimate of your parameters, you should get similar estimates from different implementation of bootstrap.
For example, you can see the distribution of R^2 is very similar:
set.seed(111)
a <- rpois(n = 100, lambda = 10)
b <- rnorm(n = 100, mean = 5, sd = 1)
DF<- data.frame(a,b)
set.seed(123)
x=length(DF$a)
B_manually<- data.frame(replicate(999, summary(lm(a~b, data = DF[sample(x, replace = T),]))$r.squared))
library(boot)
B_boot <- boot(DF, function(data,indices)
summary(lm(a~b, data[indices,]))$r.squared,R=999)
par(mfrow=c(2,1))
hist(B_manually[,1],breaks=seq(0,0.4,0.01),main="dist of R2 manual")
hist(B_boot$t,breaks=seq(0,0.4,0.01),main="dist of R2 boot")
The function confint you are using, is meant for a lm object, and works on estimating a confidence interval for the coefficient, see help page. It takes the standard error of the coefficient and multiply it by the critical t-value to give you confidence interval. You can check out this book page for the formula. The objects from your bootstrapping are not lm objects and this function doesn't work. It is not meant for any other estimates.

How to use lapply with get.confusion_matrix() in R?

I am performing a PLS-DA analysis in R using the mixOmics package. I have one binary Y variable (presence or absence of wetland) and 21 continuous predictor variables (X) with values ranging from 1 to 100.
I have made the model with the data_training dataset and want to predict new outcomes with the data_validation dataset. These datasets have exactly the same structure.
My code looks like:
library(mixOmics)
model.plsda<-plsda(X,Y, ncomp = 10)
myPredictions <- predict(model.plsda, newdata = data_validation[,-1], dist = "max.dist")
I want to predict the outcome based on 10, 9, 8, ... to 2 principal components. By using the get.confusion_matrix function, I want to estimate the error rate for every number of principal components.
prediction <- myPredictions$class$max.dist[,10] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
I can do this seperately for 10 times, but I want do that a little faster. Therefore I was thinking of making a list with the results of prediction for every number of components...
library(BBmisc)
prediction_test <- myPredictions$class$max.dist
predictions_components <- convertColsToList(prediction_test, name.list = T, name.vector = T, factors.as.char = T)
...and then using lapply with the get.confusion_matrix and get.BER function. But then I don't know how to do that. I have searched on the internet, but I can't find a solution that works. How can I do this?
Many thanks for your help!
Without reproducible there is no way to test this but you need to convert the code you want to run each time into a function. Something like this:
confmat <- function(x) {
prediction <- myPredictions$class$max.dist[,x] #prediction based on 10 components
confusion.mat = get.confusion_matrix(truth = data_validatie[,1], predicted = prediction)
get.BER(confusion.mat)
}
Now lapply:
results <- lapply(10:2, confmat)
That will return a list with the get.BER results for each number of PCs so results[[1]] will be the results for 10 PCs. You will not get values for prediction or confusionmat unless they are included in the results returned by get.BER. If you want all of that, you need to replace the last line to the function with return(list(prediction, confusionmat, get.BER(confusion.mat)). This will produce a list of the lists so that results[[1]][[1]] will be the results of prediction for 10 PCs and results[[1]][[2]] and results[[1]][[3]] will be confusionmat and get.BER(confusion.mat) respectively.

Different results when performing PCA in R with princomp() and principal ()

I tried to use princomp() and principal() to do PCA in R with data set USArressts. However, I got two different results for loadings/rotaion and scores.
First, I centered and normalised the original data frame so it is easier to compare the outputs.
library(psych)
trans_func <- function(x){
x <- (x-mean(x))/sd(x)
return(x)
}
A <- USArrests
USArrests <- apply(USArrests, 2, trans_func)
princompPCA <- princomp(USArrests, cor = TRUE)
principalPCA <- principal(USArrests, nfactors=4 , scores=TRUE, rotate = "none",scale=TRUE)
Then I got the results for the loadings and scores using the following commands:
princompPCA$loadings
principalPCA$loadings
Could you please help me to explain why there is a difference? and how can we interprete these results?
At the very end of the help document of ?principal:
"The eigen vectors are rescaled by the sqrt of the eigen values to produce the component loadings more typical in factor analysis."
So principal returns the scaled loadings. In fact, principal produces a factor model estimated by the principal component method.
In 4 years, I would like to provide a more accurate answer to this question. I use iris data as an example.
data = iris[, 1:4]
First, do PCA by the eigen-decomposition
eigen_res = eigen(cov(data))
l = eigen_res$values
q = eigen_res$vectors
Then the eigenvector corresponding to the largest eigenvalue is the factor loadings
q[,1]
We can treat this as a reference or the correct answer. Now we check the results by different r functions.
First, by function 'princomp'
res1 = princomp(data)
res1$loadings[,1]
# compare with
q[,1]
No problem, this function actually just return the same results as 'eigen'. Now move to 'principal'
library(psych)
res2 = principal(data, nfactors=4, rotate="none")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
You may find they are still different. The problem is the 'principal' function does eigendecomposition on the correlation matrix by default. Note: PCA is not invariant with rescaling the variables. If you modify the code as
res2 = principal(data, nfactors=4, rotate="none", cor="cov")
# the loadings of the first PC is
res2$loadings[,1]
# compare it with the results by eigendecomposition
sqrt(l[1])*q[,1] # re-scale the eigen vector by sqrt of eigen value
Now, you will get the same results as 'eigen' and 'princomp'.
Summarize:
If you want to do PCA, you'd better apply 'princomp' function.
PCA is a special case of the Factor model or a simplified version of the factor model. It is just equivalent to eigendecomposition.
We can apply PCA to get an approximation of a factor model. It doesn't care about the specific factors, i.e. epsilons in a factor model. So, if you change the number of factors in your model, you will get the same estimations of the loadings. It is different from the maximum likelihood estimation.
If you are estimating a factor model, you'd better use 'principal' function, since it provides more functions, like rotation, calculating the scores by different methods, and so on.
Rescale the loadings of a PCA model doesn't affect the results too much. Since you still project the data onto the same optimal direction, i.e. maximize the variation in the resulting PC.
ev <- eigen(R) # R is a correlation matrix of DATA
ev$vectors %*% diag(ev$values) %*% t(ev$vectors)
pc <- princomp(scale(DATA, center = F, scale = T),cor=TRUE)
p <-principal(DATA, rotate="none")
#eigen values
ev$values^0.5
pc$sdev
p$values^0.5
#eigen vectors - loadings
ev$vectors
pc$loadings
p$weights %*% diag(p$values^0.5)
pc$loading %*% diag(pc$sdev)
p$loadings
#weights
ee <- diag(0,2)
for (j in 1:2) {
for (i in 1:2) {
ee[i,j] <- ev$vectors[i,j]/p$values[j]^0.5
}
};ee
#scores
s <- as.matrix(scale(DATA, center = T, scale = T)) %*% ev$vectors
scale(s)
p$scores
scale(pc$scores)

k-Fold Cross Validation in R - Negative Value Predictions

My predicted values are all negative. I would have expected 0's or 1's. Can anyone see where i am going wrong?
fold = 10
end = nrow(birthwt)
fold_2 = floor(end/fold)
df_i = birthwt[sample(nrow(birthwt)),] # random sort the dataframe birthwt
tester = df_i[1:fold_2,] # remove first tenth of rows - USE PREDICT ON THIS DATA
trainer = df_i[-c(1:fold_2),] # all other than the first tenth of rows - USE GLM ON THIS DATA
mod = glm(low~lwt,family=binomial,data=trainer)
ypred = predict(mod,data=tester) # predicted values
The default for predict.glm is to give you the value of the link (on the scale of the linear predictors) before transformation. If you want to predict the response, use
ypred <- predict(mod, data=tester, type="response")
If may be helpful to read the ?predict.glm help file.

What does monmlp.predict method return?

I use Package ‘monmlp’ package in R as follows. (Monotone multi-layer perceptron neural network)
model = monmlp.fit(trainData, trainLabs, hidden1=3, n.ensemble=1, bag=F,silent=T)
pred = monmlp.predict(testData,model)
preds = as.numeric(pred)
labs = as.numeric(testLabs)
pr = prediction(preds,labs)
pf = performance(pr,"auc")
pf#y.values[[1]]
I want to predict some new data using the trained model and take the instances which result higher than a threshold value like 0.9.
In brief, I want to take instances that more likely to be in class 1 using a threshold.
classes are 0 and 1, and
pred = monmlp.predict(testData,model)
head(pred)
returns
[,1]
311694 0.005271582
129347 0.005271582
15637 0.005271582
125458 0.005271582
315130 0.010411831
272375 0.010411831
What are these values? Probabilty values?
If yes what does these values mean?
pred[which(pred>1)]
[1] 1023.839 1023.839 1023.839
Thanks.
Regarding the output: "a matrix with number of rows equal to the number of samples and number of columns equal to the number of predictand variables. If weights is from an ensemble of models, the matrix is the ensemble mean and the attribute ensemble contains a list with predictions for each ensemble member."
Source:
http://cran.r-project.org/web/packages/monmlp/monmlp.pdf
I've never used the package nor the technique, but maybe the quoted answer may mean something to you

Resources