Predict Logistf - r

I'm using a R package called logistf to make a Logistc Regression and I saw that there's no predict function for new data in this package and predict package does not work with this, so I found a code that show how making this with new data:
fit<-logistf(Tax ~ L20+L24+L28+L29+L31+L32+L33+L36+S10+S15+S16+S17+S20, data=trainData)
betas <- coef(fit)
X <- model.matrix(fit, data=testData)
probs <- 1 / (1 + exp(-X %*% betas))
I want to make a cross validation version with this using fit$predict and the probabilities that probs generate for me. Has anyone ever done something like this before?
Other thing that I want to know is about fit$predict I'm making a binary logistic regression, and this function returns many values, are these values from class 0 or 1, how can I know this? Thanks

While the code that you wrote works perfectly, there is a concise way of getting the same results seemingly:
brglm_model <- brglm(formula = response ~ predictor , family = "binomial", data = train )
brglm_pred <- predict(object = brglm_model, newdata = test , type = "response")
About the CV, you have to write a few lines of code I guess:
#Setting the number of folds, and number of instances in each fold
n_folds <- 5
fold_size <- nrow(dataset) %/% 5
residual <- nrow(dataset) %% 5
#label the instances based on the number of folds
cv_labels <- c(rep(1,fold_size),rep(2,fold_size), rep(3,fold_size), rep(4,fold_size), rep(5,fold_size), rep(5,residual))
# the error term would differ based on each threshold value
t_seq <- seq(0.1,0.9,by = 0.1)
index_mat <- matrix(ncol = (n_folds+1) , nrow = length(t_seq))
index_mat[,1] <- t_seq
# the main loop for calculation of the CV error on each fold
for (i in 1:5){
train <- dataset %>% filter(cv_labels != i)
test <- dataset %>% filter(cv_labels == i )
brglm_cv_model <- brglm(formula = response_var ~ . , family = "binomial", data = train )
brglm_cv_pred <- predict(object = brglm_model, newdata = test , type = "response")
# error formula that you want, e.g. misclassification
counter <- 0
for (treshold in t_seq ) {
counter <- counter + 1
conf_mat <- table( factor(test$response_var) , factor(brglm_cv_pred>treshold, levels = c("FALSE","TRUE") ))
sen <- conf_mat[2,2]/sum(conf_mat[2,])
# other indices can be computed as follows
#spec <- conf_mat[1,1]/sum(conf_mat[1,])
#prec <- conf_mat[2,2]/sum(conf_mat[,2])
#F1 <- (2*prec * sen)/(prec+sen)
#accuracy <- (conf_mat[1,1]+conf_mat[2,2])/sum(conf_mat)
#here I am only interested in sensitivity
index_mat[counter,(i+1)] <- sen
}
}
# final data.frame would be the mean of sensitivity over each threshold value
final_mat <- matrix(nrow = length(t_seq), ncol = 2 )
final_mat[,1] <- t_seq
final_mat[,2] <- apply(X = index_mat[,-1] , MARGIN = 1 , FUN = mean)
final_mat <- data.frame(final_mat)
colnames(final_mat) <- c("treshold","sensitivity")
#why not having a look at the CV-sensitivity of the model over threshold values?
ggplot(data = final_mat) +
geom_line(aes(x = treshold, y = sensitivity ), color = "blue")

Related

Misuse predict.rq in the package quantreg?

I am using quantreg package to predict new data based on training set. However, I noticed a discrepancy between predict.rq or predict and doing it manually. Here is an example:
The quantile regression setting is
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
The new data set I want to predict is
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
I use predict.rq or predict to predict newdata. Both return the same result:
fit_use_predict <- predict.rq( fit, newdata = as.data.frame(newdata) )
Also I manually do the prediction based on the coefficients matrix:
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
I expect both are numerically identical, but they are not:
diff <- fit_use_predict - fit_use_multiplication
print(diff)
Their difference cannot be negligible.
However, predicting the original data set X, both return the same result, i.e.,
predict(fit, newdata = data.frame(X)) = X %*% coef_mat ## True
Do I miss something when using the function? Thanks!
A more serious problem here, before we get to prediction is that the model is forcing all of the fitted quantile functions through the origin of design space and since the covariates are centered at the origin all of the quantile functions are forced to cross there. Even if the X's all lie in the positive orthant it is quite a strong assumption to say that the distribution of the response is degenerate at the origin.
I think you just have to retain the 'X' name in your data as it was in the training data.
library(quantreg)
N = 10000
tauList = seq(1:11/12)/12
y = rchisq(N,2)
X = matrix( rnorm(3*N) ,nrow = N, ncol = 3 )
fit <- rq( y ~ X-1, tau = tauList, method = "fn")
newdata <- matrix( rbeta((3*N),2,2) ,nrow = N,ncol=3 )
fit_use_predict <- predict.rq( fit, newdata = data.frame(X=I(newdata)) )
coef_mat <- coef(fit)
fit_use_multiplication <- newdata %*% coef_mat
diff <- fit_use_predict - fit_use_multiplication
max( abs(diff) )
Output is 0

Unexpected result from cross validation

I would like to perform 10-fold cross validation manually using prostate data to learn how to do it manually. I utilise the elasticnet package for code. I estimated the parameters by glmnet package (of course, it can perform cross validation too, but I would like to do that manually). After the analysis, It seems to me that I need a different criterion to choose tuning parameter other than minimum of cv.error because this gives the almost null model, if not so "where is my mistake?". (According to the original paper of Tibshirani, optimum model has three variables)
Here is the code
library(ElemStatLearn)
library(glmnet)
x <- scale(prostate[,1:8],T,T)
y <- scale(prostate[,9],T,F)
lambda = seq(0,1,0.02)
cv.folds <- function(n, folds = 10){
split(sample(1:n), rep(1:folds, length = n))
}
c.val <- function(x, y, K = 10, lambda, plot.it = TRUE){
n <- nrow(x)
all.folds <- cv.folds(length(y), K)
residmat <- matrix(0, length(lambda), K)
for(i in seq(K)) {
omit <- all.folds[[i]]
xk <- as.matrix(x[-omit, ])
yk <- as.vector(y[-omit])
xg <- x[omit, ]
yg <- y[omit]
fit <- glmnet(xk, yk, family="gaussian",
alpha=1, lambda=lambda,standardize = FALSE, intercept = FALSE)
fit <- predict(fit,newx=xg,lambda=lambda)
if(length(omit)==1){fit<-matrix(fit,nrow=1)}
residmat[, i] <- apply((yg - fit)^2, 2, mean)
}
cv <- apply(residmat, 1, mean)
cv.error <- sqrt(apply(residmat, 1, var)/K)
object<-list(lambda = lambda, cv = cv, cv.error = cv.error)
if(plot.it) {
plot(lambda, cv, type = "b", xlab="lambda", ylim = range(cv, cv + cv.error, cv - cv.error))
invisible(object)
}
}
result <- c.val(x,y,K = 10,lambda = lambda)
lambda.opt <- lambda[which.min(result$cv.error)]
fit <- glmnet(x, y, family="gaussian",
alpha=1, lambda=lambda.opt,standardize = FALSE, intercept = FALSE)
coef(fit)
Result:
> coef(fit)
9 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) .
lcavol 0.01926724
lweight .
age .
lbph .
svi .
lcp .
Edit:
Model generated directly from glmnet.
fit.lasso <- glmnet(x, y, family="gaussian", alpha=1,
standardize = FALSE, intercept = FALSE)
fit.lasso.cv <- cv.glmnet(x, y, type.measure="mse", alpha=1,
family="gaussian",standardize = FALSE, intercept = FALSE)
coef.lambda.min <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.min)
coef.lambda.1se <- coef(fit.lasso.cv,s=fit.lasso.cv$lambda.1se)
cbind(coef.lambda.min,coef.lambda.1se)
Result:
9 x 2 sparse Matrix of class "dgCMatrix"
1 1
(Intercept) . .
lcavol 0.59892674 0.5286355
lweight 0.23669159 0.1201279
age -0.06979581 .
lbph 0.09392021 .
svi 0.24620007 0.1400748
lcp . .
gleason 0.00346421 .
pgg45 0.06631013 .
The second column shows the correct (lambda.1se) result.
Your "mistake" is very hard to spot: it comes from the fact that glmnet will not use the order of your own lambda vector to sort the vector of results.
Example with the data you used:
res <- glmnet(x, y, lambda=lambda)
res$lambda
So when you call the command lambda[which.min(result$cv.error)] at the end of your procedure, you will not get the value corresponding to the minimum of the cross-validated error. Also, it explains why your graph looks strange.
An easy fix would be to declare lambda at the beginning of the script as a decreasing vector:
lambda = seq(1, 0, 0.02)
Final remark: be careful when using a single lambda.

R: simulating data with lmer and predict (or else)

I am fitting the following model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
to real observations collected in "inputdata".
Now I want to generate various (1000) modelled datasets for a simulation based on the model parameters and the determined errors. I can use
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
but this always provides the same (the most likely, mean value). Is there a way to get a distribution of values, meaning to draw from a predicted distribution?
As asked by #Adam Quek a reproducable example:
#creating dataset
a <- as.factor(sort(rep(1:4,5 )))
b <- rep(1:2,10)+0.5
c <- as.factor(c( sort(rep(1:2,5)),sort(rep(1:2,5)) ))
d <- as.factor(rep(1:5,4 ))
a <- c(a,a,a)
b <- c(b,b,b)
c <- c(c,c,c)
d <- c(d,d,d)
y <- rnorm(60)
inputdata = data.frame(y,a,b,c,d)
# fitting the model
fit<- lmer(y ~ a + b + (1|c) + (1|a:d) , data=inputdata)
# making specific predictions for a parameter set
val_a1 = 1
val_b1 = 2
val_c1 = 1
val_d1 = 4
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
what I obtain is:
0.2394255
If I do it again
pred <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d = val_d1),
allow.new.levels = TRUE)
pred
I get of course:
0.2394255
but what I am searching for is a R function or routine that easily provides a suite of predictions that follow the distribution of my input values. Something like
for (i in 1:1000){
pred[i] <- predict(fit, newdata=list(a=val_a1, b=val_b1, c=val_c1, d =
val_d1),allow.new.levels = TRUE)
}
and mean(pred) = 0.2394255 but sd(pred) != 0
Thanks to #Alex W! bootMer does the job. Below for those who are interested the solution for the example:
m1 <- function(.) {
predict(., newdata=inputdata, re.form=NULL)
}
boot1 <- lme4::bootMer(fit, m1, nsim=1000, use.u=FALSE, type="parametric")
boot1$t[,1]
where boot1$t[,1]now contains the 1000 predictions when using the parameter values defined in inputdata[1,].
https://cran.r-project.org/web/packages/merTools/vignettes/Using_predictInterval.html
was a helpful link.

Prediction on Neural Network in R

I want to get the accuracy or the RMSE of the Prediction result of a neural network. I started using a Confusion Matrix, but as indicated by previous answers, the Confusion Matrix gives valid results for non Continuous variables.
Is there any way I can get the accuracy or the error rate of a Neural Network Prediction??
As an example here is the code I've got until now:
library(nnet)
library(caret)
library(e1071)
data(rock)
newformula <- perm ~ area + peri + shape
y <- rock[, "perm"]
x <- rock[!colnames(rock)%in% "perm"]
original <- datacol(rock,"perm")
nnclas_model <- nnet(newformula, data = rock, size = 4, decay = 0.0001, maxit = 500)
nnclas_prediction <- predict(nnclas_model, x)
nnclas_tab <- table(nnclas_prediction, y)
rmse <- sqrt(mean((original - nnclas_prediction)^2))
Does anyone know how can I make this work? or how can I get the Accuracy or the of the Neural Network Prediction?
Any help will be deeply appreciated.
I don't know about "nnet", but I have used the "neuralnet" library and am able to get the RMSE. Here is my full code: Just change the data for training_Data and testing_Data as per your requirements and in place of "Channel" give what is your classification attribute
dat <- read.csv("Give path of your data file here")
summary(dat)
cleandata <- dat
cleandata <- na.omit(cleandata)
#scaling
apply(cleandata,MARGIN = 2, FUN = function(x)sum(is.na(x)))
maxs = apply(cleandata, MARGIN = 2, max)
mins = apply(cleandata, MARGIN = 2, min)
scaledData = as.data.frame(scale(cleandata, center = mins, scale = maxs - mins))
summary(scaledData)
#Splitting data in 80:20 ratio
train = sample(1:nrow(scaledData), nrow(scaledData)*0.8)
test = -train
training_Data = scaledData[train,]
testing_Data = scaledData[test,]
dim(training_Data)
dim(testing_Data)
#neural net
library(neuralnet)
n <- names(training_Data)
f <- as.formula(paste("Channel ~", paste(n[!n %in% "Channel"], collapse = " + ")))
neuralnet_Model <- neuralnet(f,data = training_Data, hidden = c(2,1))
plot(neuralnet_Model)
neuralnet_Model$result.matrix
pred_neuralnet<-compute(neuralnet_Model,testing_Data[,2:8])
pred_neuralnet.scaled <- pred_neuralnet$net.result *(max(scaledData$Channel)-min(scaledData$Channel))+min(scaledData$Channel)
real.values <- (testing_Data$Channel)*(max(cleandata$Channel)-min(cleandata$Channel))+min(cleandata$Channel)
MSE.neuralnetModel <- sum((real.values - pred_neuralnet.scaled)^2)/nrow(testing_Data)
MSE.neuralnetModel
plot(real.values, pred_neuralnet.scaled, col='red',main='Real vs predicted',pch=18,cex=0.7)
abline(0,1,lwd=2)
legend('bottomright',legend='NN',pch=18,col='red', bty='n')
As mentioned in the comments, confusion matrices are for classification problems. If you meant to classify perm according to its levels, then the following code should work for you.
library(nnet)
library(caret)
library(e1071)
data(rock)
rock$perm <- as.factor(rock$perm)
nnclas_model <- nnet(perm ~ area + peri + shape, data = rock,
size = 4, decay = 0.0001, maxit = 500)
x <- rock[, 1:3]
y <- rock[, 4]
yhat <- predict(nnclas_model, x, type = 'class')
confusionMatrix(as.factor(yhat), y)
If you mean to treat perm as continuous, the confusion matrix doesn't make any sense. You should think in terms of mean-squared error instead.

binning continuous variables by IV value in R

I am building a logistic regression model in R. I want to bin continuous predictors in an optimal way in relationship to the target variable. There are two things that I know of:
the continuous variables are binned such that its IV (information value) is maximized
maximize the chi-square in the two way contingency table -- the target has two values 0 and 1, and the binned continuous variable has the binned buckets
Does anyone know of any functions in R that can perform such binning?
Your help will be greatly appreciated.
For the first point, you could bin using the weight of evidence (woe) with the package woebinning which optimizes the number of bins for the IV
library(woeBinning)
# get the bin cut points from your dataframe
cutpoints <- woe.binning(dataset, "target_name", "Variable_name")
woe.binning.plot(cutpoints)
# apply the cutpoints to your dataframe
dataset_woe <- woe.binning.deploy(dataset, cutpoint, add.woe.or.dum.var = "woe")
It returns your dataset with two extra columns
Variable_name.binned which is the labels
Variable_name.woe.binned which is the replaced values that you can then parse into your regression instead of Variable_name
For the second point, on chi2, the package discretization seems to handle it but I haven't tested it.
The methods used by regression splines to set knot locations might be considered. The rpart package probably has relevant code. You do need to penalize the inferential statistics because this results in an implicit hiding of the degrees of freedom expended in the process of moving the breaks around to get the best fit. Another common method is to specify breaks at equally spaced quantiles (quartiles or quintiles) within the subset with IV=1. Something like this untested code:
cont.var.vec <- # names of all your continuous variables
breaks <- function(var,n) quantiles( dfrm[[var]],
probs=seq(0,1,length.out=n),
na.rm=TRUE)
lapply(dfrm[ dfrm$IV == 1 , cont.var.vec] , breaks, n=5)
s
etwd("D:")
rm(list=ls())
options (scipen = 999)
read.csv("dummy_data.txt") -> dt
head(dt)
summary(dt)
mydata <- dt
head(mydata)
summary(mydata)
##Capping
for(i in 1:ncol(mydata)){
if(is.numeric(mydata[,i])){
val.quant <- unname(quantile(mydata[,i],probs = 0.75))
mydata[,i] = sapply(mydata[,i],function(x){if(x > (1.5*val.quant+1)){1.5*val.quant+1}else{x}})
}
}
library(randomForest)
x <- mydata[,!names(mydata) %in% c("Cust_Key","Y")]
y <- as.factor(mydata$Y)
set.seed(21)
fit <- randomForest(x,y,importance=T,ntree = 70)
mydata2 <- mydata[,!names(mydata) %in% c("Cust_Key")]
mydata2$Y <- as.factor(mydata2$Y)
fit$importance
####var reduction#####
vartoremove <- ncol(mydata2) - 20
library(rminer)
#####
for(i in 1:vartoremove){
rf <- fit(Y~.,data=mydata2,model = "randomForest", mtry = 10 ,ntree = 100)
varImportance <- Importance(rf,mydata2,method="sensg")
Z <- order(varImportance$imp,decreasing = FALSE)
IND <- Z[2]
var_to_remove <- names(mydata2[IND])
mydata2[IND] = NULL
print(i)
}
###########
library(smbinning)
as.data.frame(mydata2) -> inp
summary(inp)
attach(inp)
rm(result)
str(inp)
inp$target <- as.numeric(inp$Y) *1
table(inp$target)
ftable(inp$Y,inp$target)
inp$target <- inp$target -1
result= smbinning(df=inp, y="target", x="X37", p=0.0005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
summary(inp)
result$ivtable
boxplot(inp$X2~inp$Y,horizontal=T, frame=F, col="red",main="Distribution")
###Sample
require(caTools)
inp$Y <- NULL
sample = sample.split(inp$target, SplitRatio = .7)
train = subset(inp, sample == TRUE)
test = subset(inp, sample == FALSE)
head(train)
nrow(train)
fit1 <- glm(train$target~.,data=train,family = binomial)
summary(rf)
prediction1 <- data.frame(actual = test$target, predicted = predict(fit1,test ,type="response") )
result= smbinning(df=prediction1, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="test")
tail(prediction1)
write.csv(prediction1 , "test_pred_logistic.csv")
predict_train <- data.frame(actual = train$target, predicted = predict(fit1,train ,type="response") )
write.csv(predict_train , "train_pred_logistic.csv")
result= smbinning(df=predict_train, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
####random forest
rf <- fit(target~.,data=train,model = "randomForest", mtry = 10 ,ntree = 200)
prediction2 <- data.frame(actual = test$target, predicted = predict(rf,train))
result= smbinning(df=prediction2, y="actual", x="predicted", p=0.005)
result$ivtable
smbinning.plot(result,option="badrate",sub="train")
###########IV
library(devtools)
install_github("riv","tomasgreif")
library(woe)
##### K-fold Validation ########
library(caret)
cv_fold_count = 2
folds = createFolds(mydata2$Y,cv_fold_count,list=T);
smpl = folds[[i]];
g_train = mydata2[-smpl,!names(mydata2) %in% c("Y")];
g_test = mydata2[smpl,!names(mydata2) %in% c("Y")];
cost_train = mydata2[-smpl,"Y"];
cost_test = mydata2[smpl,"Y"];
rf <- randomForest(g_train,cost_train)
logit.data <- cbind(cost_train,g_train)
logit.fit <- glm(cost_train~.,data=logit.data,family = binomial)
prediction <- data.f
rame(actual = test$Y, predicted = predict(rf,test))

Resources