predict in caret ConfusionMatrix is removing rows - r

I'm fairly new to using the caret library and it's causing me some problems. Any
help/advice would be appreciated. My situations are as follows:
I'm trying to run a general linear model on some data and, when I run it
through the confusionMatrix, I get 'the data and reference factors must have
the same number of levels'. I know what this error means (I've run into it before), but I've double and triple checked my data manipulation and it all looks correct (I'm using the right variables in the right places), so I'm not sure why the two values in the confusionMatrix are disagreeing. I've run almost the exact same code for a different variable and it works fine.
I went through every variable and everything was balanced until I got to the
confusionMatrix predict. I discovered this by doing the following:
a <- table(testing2$hold1yes0no)
a[1]+a[2]
1543
b <- table(predict(modelFit,trainTR2))
dim(b)
[1] 1538
Those two values shouldn't disagree. Where are the missing 5 rows?
My code is below:
set.seed(2382)
inTrain2 <- createDataPartition(y=HOLD$hold1yes0no, p = 0.6, list = FALSE)
training2 <- HOLD[inTrain2,]
testing2 <- HOLD[-inTrain2,]
preProc2 <- preProcess(training2[-c(1,2,3,4,5,6,7,8,9)], method="BoxCox")
trainPC2 <- predict(preProc2, training2[-c(1,2,3,4,5,6,7,8,9)])
trainTR2 <- predict(preProc2, testing2[-c(1,2,3,4,5,6,7,8,9)])
modelFit <- train(training2$hold1yes0no ~ ., method ="glm", data = trainPC2)
confusionMatrix(testing2$hold1yes0no, predict(modelFit,trainTR2))

I'm not sure as I don't know your data structure, but I wonder if this is due to the way you set up your modelFit, using the formula method. In this case, you are specifying y = training2$hold1yes0no and x = everything else. Perhaps you should try:
modelFit <- train(trainPC2, training2$hold1yes0no, method="glm")
Which specifies y = training2$hold1yes0no and x = trainPC2.

Related

What does "invalid type (closure) for variable 'variable1'" mean and how do I fix it?

I am trying to write a function in R, which contains a function from another package. The code works perfectly outside a function.
I am guessing, it might have got to do something with the package I am using (survey).
A self-contained code example:
#activating the package
library(survey)
#getting the dataset into R
tm <- read.spss("tm.sav", to.data.frame = T, max.value.labels = 5)
# creating svydesign object (it basically contains the weights to adjust the variables (~persgew: also a column variable contained in the tm-dataset))
tm_w <- svydesign(ids=~0, weights = ~persgew, data = tm)
#getting overview of the welle-variable
#this variable is part of the tm-dataset. it is needed to execute the following steps
table(tm$welle)
# data manipulation as in: taking the v12d_gr-variable as well as the welle-variable and the svydesign-object to create a longitudinal variable which is transformed into a data frame that can be passed to ggplot
t <- svytable(~v12d_gr+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
v12d <- tt[2,]
v12d <- as.data.frame(v12d)
this is the code outside the function, working perfectly. since I have to transform quite a few variables in the exact same way, I aim to create a function to save up some time.
The following function is supposed to take a variable that will be transformed as an argument (v12sd2_gr).
#making sure the survey-object is loaded
tm_w <- svydesign(ids=~0, weights = ~persgew, data = data)
#trying to write a function containing the code from above
ltd_zsw <- function(variable1){
t <- svytable(~variable1+welle, tm_w)
tt <- round(prop.table(t,2)*100, digits=0)
var_ltd_zsw <- tt[2,]
var_ltd_zsw <- as.data.frame(var_ltd_zsw)
return(var_ltd_zsw)
}
Calling the function:
#as v12d has been altered already, I am trying to transform another variable v12sd2_gr
v12sd2 <- ltd_zsw(v12sd2_gr)
Console output:
Error in model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design)) :
invalid type (closure) for variable 'variable1'
Called from: model.frame.default(formula = weights ~ variable1 + welle, data = model.frame(design))
How do I fix it? And what does it mean to dynamically build a formula and reformulating?
PS: I hope it is the appropriate way to answer to the feedback in the comments.
Update: I think I was able to trace the problem back to the argument I am passing (variable1) and I am guessing it has got something to do with the fact, that I try to call a formula within the function. But when I try to call the svytable with as.formula(svytable(~variable1+welle, tm_w))it still doesn't work.
What to do?
I have found a solution to the problem.
Here is the tested and working function:
ltd_test <- function (var, x, string1="con", string2="pro") {
print (table (var))
x$w12d_gr <- ifelse(as.numeric(var)>2,1,0)
x$w12d_gr <- factor(x$w12d_gr, levels = c(0,1), labels = c(string1,string2))
print (table (x$w12d_gr))
x_w <- svydesign(ids=~0, weights = ~persgew, data = x)
t <- svytable(~w12d_gr+welle, x_w)
tt <- round(prop.table(t,2)*100, digits=0)
w12d <- tt[2,]
w12d <- as.data.frame(w12d)
}
The problem appeared to be caused by the svydesgin()-fun. In its output it produces an object which is then used by the formula for svytable()-fun. Thats why it is imperative to first create the x_w-object with svydesgin() and then use the svytable()-fun to create the t-object.
Within the code snippet I posted originally in the question the tm_w-object has been created and stored globally.
Thanks for the help to everyone. I hope this is gonna be of use to someone one day!

How to input matrix data into brms formula?

I am trying to input matrix data into the brm() function to run a signal regression. brm is from the brms package, which provides an interface to fit Bayesian models using Stan. Signal regression is when you model one covariate using another within the bigger model, and you use the by parameter like this: model <- brm(response ~ s(matrix1, by = matrix2) + ..., data = Data). The problem is, I cannot input my matrices using the 'data' parameter because it only allows one data.frame object to be inputted.
Here are my code and the errors I obtained from trying to get around that constraint...
First off, my reproducible code leading up to the model-building:
library(brms)
#100 rows, 4 columns. Each cell contains a number between 1 and 10
Data <- data.frame(runif(100,1,10),runif(100,1,10),runif(100,1,10),runif(100,1,10))
#Assign names to the columns
names(Data) <- c("d0_10","d0_100","d0_1000","d0_10000")
Data$Density <- as.matrix(Data)%*%c(-1,10,5,1)
#the coefficients we are modelling
d <- c(-1,10,5,1)
#Made a matrix with 4 columns with values 10, 100, 1000, 10000 which are evaluation points. Rows are repeats of the same column numbers
Bins <- 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T)
Bins
As mentioned above, since 'data' only allows one data.frame object to be inputted, I've tried other ways of inputting my matrix data. These methods include:
1) making the matrix within the brm() function using as.matrix()
signalregression.brms <- brm(Density ~ s(Bins,by=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])),data = Data)
#Error in is(sexpr, "try-error") :
argument "sexpr" is missing, with no default
2) making the matrix outside the formula, storing it in a variable, then calling that variable inside the brm() function
Donuts <- as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])
signalregression.brms <- brm(Density ~ s(Bins,by=Donuts),data = Data)
#Error: The following variables can neither be found in 'data' nor in 'data2':
'Bins', 'Donuts'
3) inputting a list containing the matrix using the 'data2' parameter
signalregression.brms <- brm(Density ~ s(Bins,by=donuts),data = Data,data2=list(Bins = 10^matrix(rep(1:4,times = dim(Data)[1]),ncol = 4,byrow =T),donuts=as.matrix(Data[,c(c("d0_10","d0_100","d0_1000","d0_10000"))])))
#Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
None of the above worked; each had their own errors and it was difficult troubleshooting them because I couldn't find answers or examples online that were of a similar nature in the context of brms.
I was able to use the above techniques just fine for gam(), in the mgcv package - you don't have to define a data.frame using 'data', you can call on variables defined outside of the gam() formula, and you can make matrices inside the gam() function itself. See below:
library(mgcv)
signalregression2 <- gam(Data$Density ~ s(Bins,by = as.matrix(Data[,c("d0_10","d0_100","d0_1000","d0_10000")]),k=3))
#Works!
It seems like brms is less flexible... :(
My question: does anyone have any suggestions on how to make my brm() function run?
Thank you very much!
My understanding of signal regression is limited enough that I'm not convinced this is correct, but I think it's at least a step in the right direction. The problem seems to be that brm() expects everything in its formula to be a column in data. So we can get the model to compile by ensuring all the things we want are present in data:
library(tidyverse)
signalregression.brms = brm(Density ~
s(cbind(d0_10_bin, d0_100_bin, d0_1000_bin, d0_10000_bin),
by = cbind(d0_10, d0_100, d0_1000, d0_10000),
k = 3),
data = Data %>%
mutate(d0_10_bin = 10,
d0_100_bin = 100,
d0_1000_bin = 1000,
d0_10000_bin = 10000))
Writing out each column by hand is a little annoying; I'm sure there are more general solutions.
For reference, here are my installed package versions:
map_chr(unname(unlist(pacman::p_depends(brms)[c("Depends", "Imports")])), ~ paste(., ": ", pacman::p_version(.), sep = ""))
[1] "Rcpp: 1.0.6" "methods: 4.0.3" "rstan: 2.21.2" "ggplot2: 3.3.3"
[5] "loo: 2.4.1" "Matrix: 1.2.18" "mgcv: 1.8.33" "rstantools: 2.1.1"
[9] "bayesplot: 1.8.0" "shinystan: 2.5.0" "projpred: 2.0.2" "bridgesampling: 1.1.2"
[13] "glue: 1.4.2" "future: 1.21.0" "matrixStats: 0.58.0" "nleqslv: 3.3.2"
[17] "nlme: 3.1.149" "coda: 0.19.4" "abind: 1.4.5" "stats: 4.0.3"
[21] "utils: 4.0.3" "parallel: 4.0.3" "grDevices: 4.0.3" "backports: 1.2.1"

How to get the prediction output from glmmPQL to work with performance using R?

Problem
I am using R 3.3.3 on Windows 10 (x64 bit). I get the following prediction output from the glmmPQL prediction function as follows:
library(MASS)
library(nlme)
library(dplyr)
model<-glmmPQL(a ~ b + c + d, data = trainingDataSet, family = binomial, random = list( ~ 1 | e), correlation = corAR1())
The prediction values are given as follows:
p <- predict(model, newdata=testingDataSet, type="response",level=0) (1.0)
The output it gives is as follows:
I then try to measure the performance of this output using the following code:
pr <- prediction(p, testingDataSet$a)(1.1)
It gives us the following error as follows:
Error in prediction(p, testingDataSet$a) :
Format of predictions is invalid. (1.2)
I have successfully been able to use the prediction method in R using other functions (glm,svm,nn) when the data looks something like as follows:
model<-glm(a ~ b + c + e, family = binomial(link = 'logit'), data = trainingDataSet)
p <- predict(model, newdata=testingDataSet, type="response") (1.3)
Attempts
I believe the fix to the above problem is to get it into the format shown below (1.3). I have tried the following things using R and have been failing.
I have tried casting p in 1.0 using as.numeric() and as.list() and other things. I want to get look like the p R object in 1.3. In other words, I believe the format is reason why things not working for me?
No matter what mutate or casting I try, I can't seem to get it into the form in 1.3 and image shown as desired. Especially with the index as columns features.
I'm coming up empty handed on stackoverflow and the R help files. When I use the command class(p) both tell me they are numeric.
Question
Give the above, can someone tell me how I can use R to get the output from glmmPQL in a format that the prediction function can use as shown above please?
In other words, how can I make sure the output in 1.0 can made to match the output in 1.3 in R? My attempts have failed and I would deeply appreciate someone more skilled in R to point out where I am failing?
If you use as.numeric(p) then you'll get the values you want - then the only difference is that the GLM output has names. You can add these in with something like:
p <- as.numeric(p)
names(p) <- 1:length(p)
If this doesn't work, you can use str(p) to examine the structure of the object in more depth.

bic.glm predict error: "newdata is missing variables"

I've spent a lot of time trying to solve this error and searching for solutions without any luck, and I thank you in advance for your help.
I'm trying to create predicted values from the coefficients created via BMA. Whenever I run my predict function, I am getting a "newdata is missing variables" error. All variables included in the original model are present in the new dataframe, so I'm not quite sure what the problem is. I'm working with a fairly large dataset with many independent variables. I'm fairly new to R, so I apologize if this is an obvious question!
y<-df$y
x<-df
x$y<-NULL
bic.glm<-bic.glm(x, y, strict=FALSE, OR=20, glm.family="binomial", factortype=TRUE)
predict(bic.glm.bwt, x)
I've also tried it this way:
bic.glm<-bic.glm(y~., data=df, strict=FALSE, OR=20, glm.family="binomial", factortype=TRUE)
predict(bic.glm, x)
And also with creating a new data frame...
bic.glm<-bic.glm(y~., data=df, strict=FALSE, OR=20, glm.family="binomial", factortype=TRUE)
newdata<-x
predict(bic.glm, newdata=x)
Each time I receive the same error message:
Error in predict.bic.glm(bic.glm, newdata=x) :
newdata is missing variables
Any help is very much appreciated!
First, it is bad practice to call your LHS the same name as the function call. You may be masking the function bic.glm from further use.
That minor comment aside... I just encountered the same error. After some digging, it seems that predict.bic.glm checks the names vs. the mle matrix in the bic.glm object. The problem is that somewhere in bic.glm, if factors are used, those names get a '.x' or just '.' appended at the end. Therefore, whenever you use factors you will get this error.
I communicated this to package maintainers. Meanwhile, you can work around the bug by renaming the column names of the mle object, like this (using your example):
fittedBMA<-bic.glm(y~., data=df)
colnames(fittedBMA$mle)=colnames(model.matrix(y~., data=df)) ### this is the workaround
predict(fittedBMA,newdata=x) ### should work now, if x has the same variables as df
Okay, so first look at the usage section in the cran documentation for BMA::bic.glm.
here
This example is instructive for a data.frame.
Example 2 (binomial)
library(MASS)
data(birthwt)
y <- birthwt$lo
x <- data.frame(birthwt[,-1])
x$race <- as.factor(x$race)
x$ht <- (x$ht>=1)+0
x <- x[,-9]
x$smoke <- as.factor(x$smoke)
x$ptl <- as.factor(x$ptl)
x$ht <- as.factor(x$ht)
x$ui <- as.factor(x$ui)
bic.glm.bwT <- bic.glm(x, y, strict = FALSE, OR = 20,
glm.family="binomial",
factor.type=TRUE)
predict( bic.glm.bwT, newdata = x)
bic.glm.bwF <- bic.glm(x, y, strict = FALSE, OR = 20,
glm.family="binomial",
factor.type=FALSE)
predict( bic.glm.bwF, newdata = x)

Removing character level outlier in R

I have a linear model1<-lm(divorce_rate~marriage_rate+median_age+population) for which the leverage plot shows an outlier at 28 (State variable id for "Nevada"). I'd like to specify a model without Nevada in the dataset. I tried the following but got stuck.
data<-read.dta("census.dta")
attach(data)
data1<-data.frame(pop,divorce,marriage,popurban,medage,divrate,marrate)
attach(data1)
model1<-lm(divrate~marrate+medage+pop,data=data1)
summary(model1)
layout(matrix(1:4,2,2))
plot(model1)
dfbetaPlots(lm(divrate~marrate+medage+pop),id.n=50)
vif(model1)
dataNV<-data[!data$state == "Nevada",]
attach(dataNV)
model3<-lm(divrate~marrate+medage+pop,data=dataNV)
The last line of the above code gives me
Error in model.frame.default(formula = divrate ~ marrate + medage + pop, :
variable lengths differ (found for 'medage')
I suspect that you have some glitch in your code such that you have attach()ed copies that are still lying around in your environment -- that's why it's really best practice not to use attach(). The following code works for me:
library(foreign)
## best not to call data 'data'
mydata <- read.dta("http://www.stata-press.com/data/r8/census.dta")
I didn't find divrate or marrate in the data set: I'm going to speculate that you want the per capita rates:
## best practice to use a new name rather than transforming 'in place'
mydata2 <- transform(mydata,marrate=marriage/pop,divrate=divorce/pop)
model1 <- lm(divrate~marrate+medage+pop,data=mydata2)
library(car)
plot(model1)
dfbetaPlots(model1)
This works fine for me in a clean session:
dataNV <- subset(mydata2,state != "Nevada")
## update() may be nice to avoid repeating details of the
## model specification (not really necessary in this case)
model3 <- update(model1,data=dataNV)
Or you can use the subset argument:
model4 <- update(model1,subset=(state != "Nevada"))

Resources