problems with cross validation code - r - - r

I'm writing a function to perform logistic regression on two columns of a dataframe. I can't get around the errors... I am trying to use 10-fold cross validation. Here's the code I'm using:
SAdata = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",
sep=",",head=T,row.names=1)
log.fun = function(x,y) {
prediction = data.frame()
tset = data.frame()
dframe = cbind(x,y)
dframe = as.data.frame(dframe)
dframe$fold = sample(1:10, nrow(data), replace = TRUE)
list = 1:10
for (i in 1:10) {
train = subset(dframe, fold %in% list[-i])
test = subset(dframe, fold %in% c(i))
model = glm(x~y, data=train, family=binomial)
pred = as.data.frame(predict(model, test[,-1]))
prediction <- rbind(prediction, pred)
}
}
log.fun(SAdata$chd,SAdata$obesity)
The error I get is "Error in sample.int(length(x), size, replace, prob) :
invalid 'size' argument"
Any ideas?

This is kinda sub-optimal use of for loops and specially modelling... if you want to try some good models developing try the package 'caret'
If you still want to use that function here is a workaround
SAdata = read.table("http://statweb.stanford.edu/~tibs/ElemStatLearn/datasets/SAheart.data",sep=",",head=T,row.names=1)
log.fun=function(x,y){
prediction = data.frame()
tset=data.frame()
dframe=cbind(x,y)
dframe=as.data.frame(dframe)
dframe$fold = sample(1:10, nrow(dframe), replace = TRUE)
list = 1:10
results <- list()
for (i in 1:10) {
results[[paste0('Fold',i)]]$train <- subset(dframe, fold %in% list[-i])
results[[paste0('Fold',i)]]$test <- subset(dframe, fold %in% c(i))
results[[paste0('Fold',i)]]$model <- glm(x~y, data=results[[i]]$train, family=binomial)
results[[paste0('Fold',i)]]$pred <- as.data.frame(predict(results[[i]]$model, results[[i]]$test[,-1]))
results[[paste0('Fold',i)]]$prediction <- rbind(prediction, results[[i]]$pred)
}
results}
your_results<-log.fun(SAdata$chd,SAdata$obesity)
head(your_results$Fold1$prediction)
In fact you had some problems in the function 'sample' since you were specifying 'data' and that object did not exist ... I replace it for dframe and added some names to each part of your results.
Hope it helps

Related

Generic Function for K-Fold Cross-Validation In R for Linear Models

Hi guys I need help truble shooting the fucntion below. I am using R language.
The dataset i am using is called wages and it is from a package called library(ISLR)
data(wages).
Anyhow, I am trying to develop a function that allows me to perform k-fold cross-validation on any general linear models.
The inputs/arguments to the function i am using are function(numberOfFolds, y,x,InputData)
y is the dependent variable
x is all the other variables in the dataset
inputdata is the dataset of wages
numberOfFolds is k basically.
I have developed the below code but i am getting NaN values. Not sure what is going on wrong! Could someone please help
my.k.fold.1<- function(numberOfFolds, y,x,inputData){
index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
inputData$index<-index
mse<-vector('numeric', length = numberOfFolds)
for (n in 1:numberOfFolds) {
data.train<-inputData[index!=n,]
data.test<-inputData[index==n,]
my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
formula.1<-formula(my.equation)
model.test<-lm(formula.1, data = data.train)
predictions<-predict(model.test, newdata=data.test)
mse[[n]]<-mean((data.test$y-predictions)^2)
}
return(mse)
}
my.k.fold.1(numberOfFolds = 5, y='earn', x=c('race', 'sex', 'ed', 'height', 'age'), inputData = wages)
i would like to keep the arguments the same and i can write down the column names in the y and xs
This is because the y variable is a string, so data.test$y is equivalent to data.test[["y"]]. You should replace it with data.test[[y]], which is equivalent to data.test$earn if y="earn":
my.k.fold.1<- function(numberOfFolds, y,x,inputData){
index<-sample(1:numberOfFolds, nrow(inputData), replace = T)
inputData$index<-index
mse<-vector('numeric', length = numberOfFolds)
for (n in 1:numberOfFolds) {
data.train<-inputData[index!=n,]
data.test<-inputData[index==n,]
my.equation<-paste(y,paste(x, collapse = '+'),sep='~')
formula.1<-formula(my.equation)
model.test<-lm(formula.1, data = data.train)
predictions<-predict(model.test, newdata=data.test)
mse[[n]]<-mean((data.test[[y]]-predictions)^2)
}
return(mse)
}
Here is a general purpose function. The arguments names are self descriptive. I have added an argument verbose, defaulting to FALSE.
Tested below with built-in data set mtcars.
my.k.fold.1 <- function(numberOfFolds, inputData, response, regressors, verbose = FALSE){
fmla <- paste(regressors, collapse = "+")
fmla <- paste(response, fmla, sep = "~")
fmla <- as.formula(fmla)
index <- sample(numberOfFolds, nrow(inputData), replace = TRUE)
mse.all <- numeric(numberOfFolds)
for (n in seq_len(numberOfFolds)) {
inx <- which(index != n)
data.training <- inputData[inx, ]
data.test <- inputData[-inx, ]
if(verbose){
msg <- paste("fold:", n, "nrow(training):", nrow(data.training), "nrow(test):", nrow(data.test))
message(msg)
}
model <- lm(fmla, data = data.training)
predicted <- predict(model, newdata = data.test)
mse <- mean((data.test[[response]] - predicted)^2)
mse.all[n] <- mse
}
return(mse.all)
}
X <- names(mtcars)[-c(1, 3, 5, 7)]
y <- "mpg"
set.seed(2021)
mse.kcv <- my.k.fold.1(5, mtcars, response = y, regressors = X, verbose = TRUE)
mse.kcv
#[1] 14.255583 8.355831 2.765447 7.539299 10.151655

R programming - not deleting the right column

I am writing to paste here my code.
I am following an online course in R and I was trying to automate a multiple variables regression. I have tried to check what's going on and at the beginning, it works, but when it comes to the last two variables, it enters in a loop and does not eliminate them, even though it enters in the if.
At the end, I have this error
Error in if (maxVar > sl) { : missing value where TRUE/FALSE needed
Here is the code
backwardElimination <-function(training,sl) {
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
Thanks in advance
Edit: this is the rest of my code
#importing dataset
dataset = read.csv('50_Startups.csv')
# Encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
#splitting in train / test set
library(caTools)
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 4/5)
trainingSet = subset(dataset, split == TRUE)
testSet = subset(dataset, split == FALSE)
#Transforming state in dummy variables
trainingSet$State = factor(trainingSet$State)
dummies = model.matrix(~trainingSet$State)
trainingSet = cbind(trainingSet,dummies)
profit = trainingSet$Profit
trainingSet = trainingSet[, -4]
trainingSet = trainingSet[, -4]
trainingSet = cbind(trainingSet,profit)
#calling the function
SL = 0.05
backwardElimination(trainingSet, SL)
This error indicates that you have an NA instead of a boolean value in your if statement.
if (NA) {}
## Error in if (NA) { : missing value where TRUE/FALSE needed
Either your p contains NA, either sl is NA.
Your intercepts are also fed back in the next step of modeling, you need to get rid of it before moving to the next iteration.
I can replicate your error with R in-built dataset state.x77
dataset <- as.data.frame(state.x77)
dataset$State <- rownames(dataset)
dataset$profit <- rnorm(nrow(dataset))
backwardElimination <-function(training,sl) {
if (!"profit" %in% names(training)) return(NULL)
numVar=length(training)
funzRegressor = lm(formula = profit ~.,
data = training)
p = summary(funzRegressor)$coefficients[,4]
maxVar = max(p)
#print(funzRegressor)
if (maxVar > sl){
for (j in c(1:numVar)){
if (maxVar == p[j]) {
training = training[, -j]
backwardElimination(training,sl)
}
}
}
return(summary(funzRegressor))
}
backwardElimination(dataset, 0.05)
There are NAs in some of your betas and all the p-values becomes NaN. Do you need to regress within states? Otherwise you can remove the State column to remove the error.
There will be another error when you reach the boundary case in your recursion, which you can fix :)

Error in bayesm rhierNegbinRw function:

I am attempting to fit a hierarchical negative binomial model with bayesm. Though my data is proprietary, I was able to recreate the same error with the margarine dataset. The error I get is as follows:
> look <- rhierNegbinRw(Data = list(regdata = dat1), Mcmc = list(R = 1000,
nprint = 100))
Z not specified - using a column of ones instead
Error in alpha <= 0 :
comparison (4) is possible only for atomic and list types
I set up the mock data as follows(the regression is completely nonsensical -- just trying to get the thing to work):
data(margarine)
chpr <- margarine$choicePrice
chpr$hhid <- as.factor(chpr$hhid)
N <- nlevels(chpr$hhid)
dat1 <- vector(mode = "list", length = N)
for (i in 1:N) {
dat1[[i]]$y <- chpr[chpr$hhid==levels(chpr$hhid)[i], "PPk_Stk"]
dat1[[i]]$X <- model.matrix( ~ choice + PBB_Stk,
data = chpr[chpr$hhid == levels(chpr$hhid)[i], ])
}
I would greatly appreciate any insight into this issue.

Reqsubsets results differ with coef() for model with linear dependencies

while using Regsubsets from package leaps on data with linear dependencies, I found that results given by coef() and by summary()$which differs. It seems that, when linear dependencies are found, reordering changes position of coefficients and coef() returns wrong values.
I use mtcars just to "simulate" the problem I had with other data. In first example there is no issue of lin. dependencies and best given model by BIC is mpg~wt+cyl and both coef(),summary()$which gives the same result. In second example I add dummy variable so there is possibility of perfect multicollinearity, but variables in this order (dummy in last column) don't cause the problem. In last example after changing order of variables in dataset, the problem finally appears and coef(),summary()$which gives different models. Is there anything incorrect in this approach? Is there any other way to get coefficients from regsubsets?
require("leaps") #install.packages("leaps")
###Example1
dta <- mtcars[,c("mpg","cyl","am","wt","hp") ]
bestSubset.cars <- regsubsets(mpg~., data=dta)
(best.sum <- summary(bestSubset.cars))
#
w <- which.min(best.sum$bic)
best.sum$which[w,]
#
best.sum$outmat
coef(bestSubset.cars, w)
#
###Example2
dta2 <- cbind(dta, manual=as.numeric(!dta$am))
bestSubset.cars2 <- regsubsets(mpg~., data=dta)
(best.sum2 <- summary(bestSubset.cars2))
#
w <- which.min(best.sum2$bic)
best.sum2$which[w,]
#
coef(bestSubset.cars2, w)
#
###Example3
bestSubset.cars3 <- regsubsets(mpg~., data=dta2[,c("mpg","manual","am","cyl","wt","hp")])
(best.sum3 <- summary(bestSubset.cars3))
#
w <- which.min(best.sum3$bic)
best.sum3$which[w,]
#
coef(bestSubset.cars3, w)
#
best.sum2$which
coef(bestSubset.cars2,1:4)
best.sum3$which
coef(bestSubset.cars3,1:4)
The order of vars by summary.regsubsets and regsubsets are different. The generic function coef() of regsubsets calls those two in one function, and the results are in mess if you are trying to force.in or using formula with fixed order. Changing some lines in the coef() function might help. Try codes below, see if it works!
coef.regsubsets <- function (object, id, vcov = FALSE, ...)
{
s <- summary(object)
invars <- s$which[id, , drop = FALSE]
betas <- vector("list", length(id))
for (i in 1:length(id)) {
# added
var.name <- names(which(invars[i, ]))
thismodel <- which(object$xnames %in% var.name)
names(thismodel) <- var.name
# deleted
#thismodel <- which(invars[i, ])
qr <- .Fortran("REORDR", np = as.integer(object$np),
nrbar = as.integer(object$nrbar), vorder = as.integer(object$vorder),
d = as.double(object$d), rbar = as.double(object$rbar),
thetab = as.double(object$thetab), rss = as.double(object$rss),
tol = as.double(object$tol), list = as.integer(thismodel),
n = as.integer(length(thismodel)), pos1 = 1L, ier = integer(1))
beta <- .Fortran("REGCF", np = as.integer(qr$np), nrbar = as.integer(qr$nrbar),
d = as.double(qr$d), rbar = as.double(qr$rbar), thetab = as.double(qr$thetab),
tol = as.double(qr$tol), beta = numeric(length(thismodel)),
nreq = as.integer(length(thismodel)), ier = numeric(1))$beta
names(beta) <- object$xnames[qr$vorder[1:qr$n]]
reorder <- order(qr$vorder[1:qr$n])
beta <- beta[reorder]
if (vcov) {
p <- length(thismodel)
R <- diag(qr$np)
R[row(R) > col(R)] <- qr$rbar
R <- t(R)
R <- sqrt(qr$d) * R
R <- R[1:p, 1:p, drop = FALSE]
R <- chol2inv(R)
dimnames(R) <- list(object$xnames[qr$vorder[1:p]],
object$xnames[qr$vorder[1:p]])
V <- R * s$rss[id[i]]/(object$nn - p)
V <- V[reorder, reorder]
attr(beta, "vcov") <- V
}
betas[[i]] <- beta
}
if (length(id) == 1)
beta
else betas
}
Another solution that works for me is to randomize the order of the column(independent variables) in your dataset before running the regsubsets. The idea is that after reorder hopefully the highly correlated columns will be far apart from each other and will not trigger the reorder behavior in the regsubsets algorithm.

How to extract the p.value and estimate from cor.test() in a data.frame?

In this example, I have temperatures values from 50 different sites, and I would like to correlate the Site1 with all the 50 sites. But I want to extract only the components "p.value" and "estimate" generated with the function cor.test() in a data.frame into two different columns.
I have done my attempt and it works, but I don't know how!
For that reason I would like to know how can I simplify my code, because the problem is that I have to run two times a Loop "for" to get my results.
Here is my example:
# Temperature data
data <- matrix(rnorm(500, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
# Extraction
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[,2:3] <- df1[c("estimate", "p.value")]
}
for (i in 1:50) {
df1 <- cor.test(data[,1], data[,i] )
df[i,2:3] <- df1[c("estimate", "p.value")]
}
df
I will appreciate very much your help :)
I might offer up the following as well (masking the loops):
result <- do.call(rbind,lapply(2:50, function(x) {
cor.result<-cor.test(data[,1],data[,x])
pvalue <- cor.result$p.value
estimate <- cor.result$estimate
return(data.frame(pvalue = pvalue, estimate = estimate))
})
)
First of all, I'm guessing you had a typo in your code (you should have rnorm(5000 if you want unique values. Otherwise you're going to cycle through those 500 numbers 10 times.
Anyway, a simple way of doing this would be:
data <- matrix(rnorm(5000, 10:30, sd=5), nrow = 100, ncol = 50, byrow = TRUE,
dimnames = list(c(paste("Year", 1:100)),
c(paste("Site", 1:50))) )
# Empty data.frame
df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="")
estimates = numeric(50)
pvalues = numeric(50)
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
estimates[i] = test$estimate
pvalues[i] = test$p.value
}
df$Estimate <- estimates
df$P.value <- pvalues
df
Edit: I believe your issue was is that in the line df <- data.frame(label=paste("Site", 1:50), Estimate="", P.value="") if you do typeof(df$Estimate), you see it's expecting an integer, and typeof(test$estimate) shows it spits out a double, so R doesn't know what you're trying to do with those two values. you can redo your code like thus:
df <- data.frame(label=paste("Site", 1:50), Estimate=numeric(50), P.value=numeric(50))
for (i in 1:50){
test <- cor.test(data[,1], data[,i])
df$Estimate[i] = test$estimate
df$P.value[i] = test$p.value
}
to make it a little more concise.
similar to the answer of colemand77:
create a cor function:
cor_fun <- function(x, y, method){
tmp <- cor.test(x, y, method= method)
cbind(r=tmp$estimate, p=tmp$p.value) }
apply through the data.frame. You can transpose the result to get p and r by row:
t(apply(data, 2, cor_fun, data[, 1], "spearman"))

Resources