How to update a glm model that contains NA's after fitting? error Number of observation not equal - r

I have a dataset that contains some missing values (on independent variables). I’m fitting a glm model :
f.model=glm(data = data, formula = y~x1 +x2, "binomial", na.action =na.omit )
After this model I want the ‘null’ model , so I used update:
n.model=update(f.model, . ~ 1)
This seems to work, but the number of observations in both models differ (f.model n=234; n.model n=235). So when I try to estimate a likelihood ratio I get an error: Number of observation not equal!!.
Q: How to update the model so that it accounts for the missing values?

Although it is a bit strange that na.action =na.omit dit not solve the NA problem. I decided to filter out the data.
library(epicalc) # for lrtest
vars=c(“y”, “x1”, “x2”) #variables in the model
n.data=data[,vars] #filter data
f.model=glm(data = data, formula = y~x1 +x2, binomial)
n.model=update(f.model, . ~ 1)
LR= lrtest(n.model,f.model)
If someone has a better solution or an argument way na.action in combination with update results in unequal observations, your answer or solution is more than welcome!

Related

Marginal Effect from svyglm object with a subsample in R

I need to compute marginal effects out of a Generalized Linear Model (family=Poisson) estimated via the svyglm function from the R package survey for a subsample.
First, I declared the survey desgin with:
myDesisgn = svydesign(id=data$id, strata=data$strata, weights=data$sw, data=data)
Second, I estimated my model as:
fit = svyglm(y~ x1 +x2, design=myDesisgn, data=data, subset= x3 == 1, family= poisson(link = "log"))
Finally, when I want to get the Average Marginal Effect for, let's say, x1 I run:
summary(margins(fit, variables = "x1", design=myDesisgn))
... but I get the following error message:
"Error in h(simpleError(msg, call)) :
error in evaluating the argument 'object' in selecting a method for function 'summary': 'x' and 'w' must have the same length"
Running the following does not work either:
summary(margins(fit, variables = "x1", design=myDesisgn, subset=x3==1))
Solution:
summary(margins(fit, variables = "x1", design=myDesisgn[myDesisgn$variables$x3 == 1]))
Subsetting complex surveys leads to problems in the error estimation. When interested in a parameter for a specific subsample, one should use the desired subsample to estimate the parameter of interest and the full sample for the estimation of its error.
For example, svyglm(y~x, data=data, subset = z == 1) does exactly this (beta_hat estimated using observations for which z=1 and se(beta_hat) using the full sample).
Subsetting a svy design is possible and it keeps the original design information about number of clusters, strata. The code shown above is the "manual" way of doing so. Alternative one can directly rely on the subset.survey.design {survey} function.
myDesign_subset <- subset(myDesign, data$x3 == 1)
The two methods are equivalent and produce correct z-stats.

R code: Error in model.matrix.default(mt, mf, contrasts) : Variable 1 has no levels

I am trying to build a logistic regression model with a response as diagnosis ( 2 Factor variable: B, M).
I am getting an Error on building a logistic regression model:
Error in model.matrix.default(mt, mf, contrasts) :
variable 1 has no levels
I am not able to figure out how to solve this issue.
R Code:
Cancer <- read.csv("Breast_Cancer.csv")
## Logistic Regression Model
lm.fit <- glm(diagnosis~.-id-X, data = Cancer, family = binomial)
summary(lm.fit)
Dataset Reference: https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
Your problem is similar to the one reported here on the randomForest classifier.
Apparently glm checks through the variables in your data and throws an error because X contains only NA values.
You can fix that error by
either by dropping X completely from your dataset, setting Cancer$X <- NULL before handing it to glm and leaving X out in your formula (glm(diagnosis~.-id, data = Cancer, family = binomial));
or by adding na.action = na.pass to the glm call (which will instruct to ignore the NA-warning, essentially) but still excluding X in the formula itself (glm(diagnosis~.-id-X, data = Cancer, family = binomial, na.action = na.pass))
However, please note that still, you'd have to make sure to provide the diagnosis variable in a form digestible by glm. Meaning: either a numeric vector with values 0 and 1, a logical or a factor-vector
"For binomial and quasibinomial families the response can also be specified as a factor (when the first level denotes failure and all others success)" - from the glm-doc
Just define Cancer$diagnosis <- as.factor(Cancer$diagnosis).
On my end, this still leaves some warnings, but I think those are coming from the data or your feature selection. It clears the blocking errors :)

Attempt to apply non- function, randomForest

I have a data set(train2) with 79 variables(numeric and text combined) and the SalePrice as the last column. I am trying to create a randomForest model, this is what I get as an error:
Forest <- randomForest(SalePrice~., data = train2, na.action = TRUE)
Error in model.frame.default(formula = SalePrice ~ ., data = train2, na.action = TRUE) :
attempt to apply non-function
Do you have any idea how I can solve this error?
#joran is correct. I also want to steer you in the direction of exploring these two:
ntree Number of trees to grow. This should not be set to too small a number, to ensure
that every input row gets predicted at least a few times.
mtry Number of variables randomly sampled as candidates at each split. Note that
the default values are different for classification (sqrt(p) where p is number of
variables in x) and regression (p/3)

RRF model is giving NA for test set

I am working on a regression problem .
I am using RRF in R to implement the problem .
I made two different data sets one for training and other for testing .
library(RRF)
train=read.csv('training_data.csv'.header=F)
model <- RRF(as.numeric(V128) ~ .,data=train, flagReg = 1,importance=TRUE,ntree=1000, keep.forest=TRUE,type=regression,na.action=na.roughfix)
print(model)
Call:
RRF(formula = as.numeric(V128) ~ ., data = train, flagReg = 1, importance = TRUE, ntree = 1000, keep.forest = TRUE, type = regression, na.action = na.roughfix)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 2656
Mean of squared residuals: 0.03509357
% Var explained: 81.5
Now when I am using this model to predict for test set .
test = read.csv('testing_data.csv',header=F)
predict(model,test,type="response")
It is giving NA for all the test data set .
When I try it for train data set it's still giving me the same. Which I didn't expect at all.
When I run
predict(model,new_data=test,type="response")
or
predict(model,new_data=train,type="response")
The out-of-bag prediction in object is returned . Which implies data not given .
What should I do to get the prediction ? After that I also want to find the accuracy or performance for predictions .
I'm currently going through the same problem. This answer has already helped me identify the cause of NA predicted probabilities. Short(er) answer: there are NAs in your features (or predictors).
I'm also looking for a solution for the rest of the modeling process. When I have enough information to address the rest of your questions, I'll come back to update this answer.
Rumba...

Why doesn't predict like the dimensions of my newdata?

I want to perform a multiple regression in R and make predictions based on the trained model. Below is an example code I am using:
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
predict(lm(price ~ predictors), data.frame(predictors=matrix(c(3,5),nrow=1)))
So, based on the 2-variate regression model trained by 5 samples, I want to make a prediction for the test data point where the first variate is 3 and second variate is 5. But I get a warning from above code saying that 'newdata' had 1 rows but variable(s) found have 5 rows. How can I correct above code? Below code works fine where I give the variables separately to the model formula. But since I will have hundreds of variates, I have to give them in a matrix since it would be unfeasible to append hundreds of columns using + sign.
price = c(10,18,18,11,17)
predictor1 = c(5,6,3,4,5)
predictor2 = c(2,1,8,5,6)
predict(lm(price ~ predictor1 + predictor2), data.frame(predictor1=3,predictor2=5))
Thanks in advance!
The easiest way to get past the issue of matching up variable names from a matrix of covariates to newdata data.frame column names is to put your input data into a data.frame as well. Try this
price = c(10,18,18,11,17)
predictors = cbind(c(5,6,3,4,5),c(2,1,8,5,6))
indata<-data.frame(price,predictors=predictors)
predict(lm(price ~ ., indata), data.frame(predictors=matrix(c(3,5),nrow=1)))
Here we combine price and predictors into a data.frame such that it will be named the same say as the newdata data.frame. We use the . in the formula to mean "all other columns" so we don't have to specify them explicitly.
Need to build the model first, then predict from it:
mod1 <- lm(price ~ predictor1 + predictor2)
predict( mod1 , data.frame(predictor1=3,predictor2=5))

Resources