I thought that using bootstrap would resample my LDA but I am not sure. Additionally, if bootstrap does work, I am not sure how to code a bootstrap in r
Here is my LDA code:
library('MASS')
n=nrow(iris)
train = sample(n ,size = floor(n*0.75), replace = F)
train.species =Species[train]
test.species=Species[-train]
lda.fit = lda(Species~. , data=iris, subset=train)
The code below uses boot() library to perform bootstrap on iris dataset using LDA to get the standard errors for the LD1 and LD2 coefficients. Also, initial part of the code shows LDA fitting without bootstrap with the same coefficients.
# Library
library(MASS)
library(boot)
# Get data
data(iris)
names(iris) <- gsub("\\.", "", names(iris)) #remove dots from column names
# Split data into train and test sets
train_index <-sample(seq(nrow(iris)),floor(dim(iris)[1]*0.75))
train <- iris[train_index,]
test <- iris[-train_index,]
test_Y <- test[, c('Species')]
test_X <- subset(test, select=-c(Species))
#### LDA without bootstrap:
# Fit LDA to train data:
lda.fit = lda(Species ~ . , data=train)
lda.fit
# Predict test_Y based on lda.fit above
lda.pred <- predict(lda.fit, test_X)
lda.class <- lda.pred$class
# Confusion matrix
table(lda.class, test_Y)
#### LDA with bootstrap:
# Fit LDA to train data: to get standard errors for coefficients
set.seed(1)
boot.fn <- function(data,index){
return(coefficients(lda(Species ~ SepalLength + SepalWidth + PetalLength + PetalWidth, data=data, subset=index)))
}
# Call boot(): This returns LD1 and LD2 for each predictor
boot(train, boot.fn, 1000)
# NOTE: Here, in Bootstrap Statistics output, t1* to t4* are LD1 coefficients and t5* to t8* are LD2 coefficients
Related
set.seed(1234)
training.samples=RealEstate$Y.house.price.of.unit.area%>%createDataPartition(p=0.75,list=FALSE)
train.data=RealEstate[training.samples,]
test.data=RealEstate[-training.samples,]
Price.Model1=lm(Y.house.price.of.unit.area~factor(X1.transaction.date)+
X2.house.age+
X3.distance.to.the.nearest.MRT.station+
X4.number.of.convenience.stores+
X5.latitude+
X6.longitude,
data=train.data)
Would this be correct?
mean((test.data$Y.house.price.of.unit.area-predict(Price.Model1))^2)
I'm getting this warning so I'm not sure if I'm doing it right:
Warning in test.data$Y.house.price.of.unit.area - predict(Price.Model1) :
longer object length is not a multiple of shorter object length
Mean squared error is defined as:
To calculate it in R:
Fit the model with the training data
Use the test data to get predictions with the predict() function
Use the predicted and actual values of the test data to calculate MSE
Using some fake data...
test_ix <- floor(runif(nrow(mtcars) * 0.2, 1, nrow(mtcars)))
train <- mtcars[-test_ix, ]
X_test <- mtcars[test_ix, ] %>%
select(!mpg)
y_test <- mtcars[test_ix, "mpg"]
fit <- lm(mpg ~ ., data = train)
yhat <- predict(fit, X_test)
mse <- mean((y_test - yhat) ** 2)
To obtain the RMSE take the square root of the MSE.
Suppose I want to assess the goodness of a linear model before and after leaving out a covariate, and I want to implement some kind of bootstrapping.
I tried to bootstrap the sum of residuals of both models and then I applied the Kolmogorov-Smirnov test to assess if the two are the same distributions.
The minimal working code:
lm.statistic.resid <- function(data,i){
d<-data[i,]
r.gressor <- colnames(data)[1]
c.variates <- colnames(data)[-1]
lm.boot <- lm(data=d)
out <- sum(resid(lm.boot))
return(out)
}
df.restricted <- mtcars[ , names(mtcars) != c("wt")]
classical.lm <- lm(mtcars)
restricted.lm <- lm(df.restricted)
boot.regression.full = boot(df,
statistic=lm.statistic.resid,
R=1000)
boot.regression.restricted = boot(df.restricted,
statistic=lm.statistic.resid,
R=1000)
x <- boot.regression.restricted$t
y <- boot.regression.full$t
ks.test(x,y)
However, I get kind of the same result both in removing wt (which statistically significant) and am (which is not).
I should expect a smaller p-value in case I remove wt.
I am trying to plot a ROC curve with standard deviation using the the ROCR package.
I am using the quality.csv file for a reproducible example to be found here -- https://courses.edx.org/courses/course-v1:MITx+15.071x_3+1T2016/courseware/5893e4c5afb74898b8e7d9773e918208/030bf0a7275744f4a3f6f74b95169c04/
My code is the following:
data <- fread("quality.csv")
glimpse(data)
set.seed(88)
split <- sample.split(data$PoorCare, SplitRatio = 0.75)
data_train <- data[split, ]
data_test <- data[!split, ]
#--------------------------------------------------------------------------
# FITTING A MODEL
#--------------------------------------------------------------------------
model <- glm(PoorCare ~ OfficeVisits + Narcotics , data_train, family = "binomial")
#--------------------------------------------------------------------------
# MAKE PREDICTIONS ON THE TEST DATASET
#--------------------------------------------------------------------------
predict_Test <- predict(model, type = "response", newdata = data_test)
###########################################################################
# THE ROCR PACKAGE
###########################################################################
###########################################################################
# CREATE A PERFORMANCE OBJECT
###########################################################################
prediction_obj <- prediction(predict_Test, data_test$PoorCare)
#==========================================================================
# CALCULATE AUC
#==========================================================================
auc = as.numeric(performance(prediction_obj , "auc")#y.values)
# 0.7994792
#==========================================================================
# PLOT ROC CURVE WITH ERROR ESTIMATES
#==========================================================================
plot(perf, colorize=T, avg='threshold', spread.estimate='stddev', spread.scale = 2)
What I get is a ROC curve but without the standard errors:
Could you indicate what is wrong with my code and how to correct it?
Your advice will be appreciated.
The standard deviations and the CIs of the ROC curve can be plotted if a number of repeated (cross-validation or bootstrap) predictions has been performed.
Consider for example 100 repeated splits of data in training and testing sets with glm estimation and prediction:
library(dplyr)
library(data.table)
library(caTools)
library(ROCR)
data <- fread("quality.csv")
glimpse(data)
set.seed(1)
reps <- 100
predTests <- vector(mode="list", reps)
Labels <- vector(mode="list", reps)
for (k in 1:reps) {
splitk <- sample.split(data$PoorCare, SplitRatio = 0.75)
data_traink <- data[splitk, ]
data_testk <- data[!splitk, ]
model <- glm(PoorCare ~ OfficeVisits + Narcotics ,
data_traink, family = "binomial")
predTests[[k]] <- predict(model, type = "response", newdata = data_testk)
Labels[[k]] <- data_testk$PoorCare
}
Now calculate prediction and performance objects using the predTests and Labels lists:
predObjs <- prediction(predTests, Labels)
Perfs <- performance(predObjs , "tpr", "fpr")
and plot the set of ROC curves with mean values and confidence intervals:
plot(Perfs, col="grey82", lty=3)
plot(Perfs, lwd=3, avg="threshold", spread.estimate="stddev", add=TRUE, colorize=TRUE)
I am building a linear regression on the cafe dataset and I want to validate the results by calculationg Leave-One-Out CrossValidation.
I wrote my own function for that, which works if I fit lm() on all data, but when I am using subset of columns (from Stepwise regression), I am getting an error. Consider the following code:
cafe <- read.table("C:/.../cafedata.txt", header=T)
cafe$Date <- as.Date(cafe$Date, format="%d/%m/%Y")
#Delete row 34
cafe <- cafe[-c(34), ]
#wont need date
cafe <- cafe[,-1]
library(DAAG)
#center the data
cafe.c <- data.frame(lapply(cafe[,2:15], function(x) scale(x, center = FALSE, scale = max(x, na.rm = TRUE))))
cafe.c$Day.of.Week <- cafe$Day.of.Week
cafe.c$Sales <- cafe$Sales
#Leave-One-Out CrossValidation function
LOOCV <- function(fit, dataset){
# Attributes:
#------------------------------
# fit: Fit of the model
# dataset: Dataset to be used
# -----------------------------
# Returns mean of squared errors for each fold - MSE
MSEP_=c()
for (idx in 1:nrow(dataset)){
train <- dataset[-c(idx),]
test <- dataset[idx,]
MSEP_[idx]<-(predict(fit, newdata = test) - dataset[idx,]$Sales)^2
}
return(mean(MSEP_))
}
Then when I fit the simple linear model and call the function, it works:
#----------------Simple Linear regression with all attributes-----------------
fit.all.c <- lm(cafe.c$Sales ~., data=cafe.c)
#MSE:258
LOOCV(fit.all.c, cafe.c)
However when I fit the same lm() only with subset of columns, I get an error:
#-------------------------Linear Stepwise regression--------------------------
step <- stepAIC(fit.all.c, direction="both")
fit.step <- lm(cafe.c$Sales ~ cafe.c$Bread.Sand.Sold + cafe.c$Bread.Sand.Waste
+ cafe.c$Wraps.Waste + cafe.c$Muffins.Sold
+ cafe.c$Muffins.Waste + cafe.c$Fruit.Cup.Sold
+ cafe.c$Chips + cafe.c$Sodas + cafe.c$Coffees
+ cafe.c$Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
5495.069
There were 50 or more warnings (use warnings() to see the first 50)
If I look closer:
idx <- 1
train <- cafe.c[-c(idx)]
test <- cafe.c[idx]
(predict(fit.step, newdata = test) -cafe.c[idx]$Sales)^2
I get MSE for all rows and an error:
Warning message:
'newdata' had 1 row but variables found have 47 rows
EDIT
I have found this question about the error, which says that this error occurs when I give different names to the columns, however this is not the case.
Change your code like the following:
fit.step <- lm(Sales ~ Bread.Sand.Sold + Bread.Sand.Waste
+ Wraps.Waste + Muffins.Sold
+ Muffins.Waste + Fruit.Cup.Sold
+ Chips + Sodas + Coffees
+ Day.of.Week,data=cafe.c)
LOOCV(fit.step, cafe.c)
# [1] 278.8984
idx <- 1
train <- cafe.c[-c(idx),]
test <- cafe.c[idx,] # need to select the row, not the column
(predict(fit.step, newdata = test) -cafe.c[idx,]$Sales)^2
# 1
# 51.8022
Also, you LOOCV implementation is not correct. You must fit a new model everytime on the train dataset on the leave-1-out fold. Right now you are training the model once on the entire dataset and using the same single model to compute the MSE on held out test dataset for each leave-1-out fold, but ideally you should have different models trained on different training datasets.
I am trying to bootstrap a simple multinomial regression in R, and I am getting an error:
Error in is.data.frame(data) : object 'd' not found
What is really strange is that I am using the same code (adjusted to this particular problem) as in a tutorial for boot package at Quick-R, and that same code also worked when I am using different function (like lm()). For sure, I am doing something stupid, but I do not see what. Please, if anyone can help, I would appreciate a lot.
This is an example:
require(foreign)
require(nnet)
require(boot)
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d)
s = summary(fit)
return(list(fit$coefficients, fit$standard.errors))
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
The error happens in the summary() part, also the object returned by multinom() does not have coefficients and standard.errors. It seems, that summary.multinom() in turn calculates the hessian from your data, d, which for some reason (probably a scoping issue) cannot be found. A quick fix is to add Hess = TRUE:
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, Hess = TRUE)
s = summary(fit)
return( cbind(s$coefficients, s$standard.errors) )
}
# 5 replications
results = list()
results <- boot(
data=ml, statistic=bs, R=5, parallel='multicore',
formula=prog~write
)
Multinomial logistic regression returns a matrix of coefficients using the coef() function. This differs from a lm or glm model which returns a vector of coefficients.
library(foreign) # read.dta()
library(nnet) # multinom()
require(boot) # boot()
# an example for multinomial logistic regression
ml = read.dta('http://www.ats.ucla.edu/stat/data/hsbdemo.dta')
ml = ml[,c(5,7,3)]
names(ml)
bs <- function(formula, data, indices) {
d = data[indices,] # allows boot to select sample
fit = multinom(formula, data=d, maxit=1000, trace=FALSE)
#s = summary(fit)
#return(list(fit$coefficients, fit$standard.errors))
estimates <- coef(fit)
return(t(estimates))
}
# enable parallel
library(parallel)
cl <- makeCluster(2)
clusterExport(cl, "multinom")
# 10000 replications
set.seed(1984)
results <- boot(
data=ml, statistic=bs, R=10000, parallel = "snow", ncpus=2, cl=cl,
formula=prog~write
)
# label the estimates
subModelNames <- colnames(results$t0)
varNames <- rownames(results$t0)
results$t0
estNames <- apply(expand.grid(varNames,subModelNames),1,function(x) paste(x,collapse="_"))
estNames
colnames(results$t) <- estNames
# summary of results
library(car)
summary(results)
confint(results, level=0.95, type="norm")
confint(results, level=0.95, type="perc")
confint(results, level=0.95, type="bca")
# plot the results
hist(results, legend="separate")