How to deal with NA in a panel data regression? - r

I am trying to predict fitted values over data containing NAs, and based on a model generated by plm. Here's some sample code:
require(plm)
test.data <- data.frame(id=c(1,1,2,2,3), time=c(1,2,1,2,1),
y=c(1,3,5,10,8), x=c(1, NA, 3,4,5))
model <- plm(y ~ x, data=test.data, index=c("id", "time"),
model="pooling", na.action=na.exclude)
yhat <- predict(model, test.data, na.action=na.pass)
test.data$yhat <- yhat
When I run the last line I get an error stating that the replacement has 4 rows while data has 5 rows.
I have no idea how to get predict return a vector of length 5...
If instead of running a plm I run an lm (as in the line below) I get the expected result.
model <- lm(y ~ x, data=test.data, na.action=na.exclude)

As of version 2.6.2 of plm (2022-08-16), this should work out of the box: Predict out of sample on fixed effects model (from the NEWS file:
prediction implemented for fixed effects models incl. support for argument newdata and out-of-sample prediction. Help page (?predict.plm) added to specifically explain the prediction for fixed effects models and the out-of-sample case.
I think this is something that predict.plm ought to handle for you -- seems like an oversight on the package authors' part -- but you can use ?napredict to implement it for yourself:
pp <- predict(model, test.data)
na.stuff <- attr(model$model,"na.action")
(yhat <- napredict(na.stuff,pp))
## [1] 1.371429 NA 5.485714 7.542857 9.600000

Related

Why no t-scores or p-values from summary(glm) in Databricks?

I'm using Databricks with the SparkR package to build a glm model. Everything seems to run ok except when I run summary(lm1). Instead of getting Variable, Estimate, Std.Error, t-value & p-value (see pic below - this is what I'd expect to see, NOT what I'm getting), I just get the variable and estimate. The only thing I can think is that the data set is big enough (train1 is 12 million rows and test1 is 6 million rows) that all estimates have 0 p-values. Any other reasons this would happen??
library(SparkR)
rdf <- sql("select * from myTable") #read data
train1 <- rdf[rdf$ntile_3 != 1,] # split into test and train based on ntile in table
test1 <- rdf[rdf$ntile_3 == 1,]
vtu1 <- c('var1','var2','var3')
lm1 <- glm( target ~., train1[,c(vtu1,'target' )],family = 'gaussian')
pred1 <- predict(lm1, test1)
summary(lm1)
as you specify family = Gaussian in your model, your glm model seems to be equivalent to a standard linear regression model (analyzed by lm in R).
For an extensive answer to your question, see for example here: https://stats.stackexchange.com/questions/187100/interpreting-glm-model-output-assessing-quality-of-fit
If you specify your model using lm, you should get the output you want.

Probability predictions with model averaged Cumulative Link Mixed Models fitted with clmm in ordinal package

I found that the predict function is currently not implemented in cumulative link mixed models fitted using the clmm function in ordinal R package. While predict is implemented for clmm2 in the same package, I chose to apply clmm instead because the later allows for more than one random effects. Further, I also fitted several clmm models and performed model averaging using model.avg function in MuMIn package. Ideally, I want to predict probabilities using the average model. However, while MuMIn supports clmm models, predict will also not work with the average model.
Is there a way to hack the predict function so that the function not only could predict probabilities from a clmm model, but also predict using model averaged coefficients from clmm (i.e. object of class "averaging")? For example:
require(ordinal)
require(MuMIn)
mm1 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "probit", threshold = "equidistant")
## test random effect:
mm2 <- clmm(SURENESS ~ PROD + (1|RESP) + (1|RESP:PROD), data = soup,
link = "logistic", threshold = "equidistant")
#create a model selection object
mm.sel<-model.sel(mm1,mm2)
##perform a model average
mm.avg<-model.avg(mm.sel)
#create new data and predict
new.data<-soup
##predict with indivindual model
predict(mm1, new.data)
I got the following error message:
In UseMethod("predict") :
no applicable method for predict applied to an object of class "clmm"
##predict with model average
predict(mm.avg, new.data)
Another error is returned:
Error in predict.averaging(mm.avg, new.data) :
predict for models 'mm1' and 'mm2' caused errors
I've been using clmm as well and yes I confirm predict.clmm is NOT (yet?) implemented. I didn't yet check the source code for fake.predict.clmm. It might work. If it doesn't, you're stuck with doing stuff by hand or using predict.clmm2.
I found a potential solution (pasted below) but have not been able to make work for my data.
Solution here: https://gist.github.com/mainambui/c803aaf857e54a5c9089ea05f91473bc
I think the problem is the number of coefficients I am using but am not experienced enough to figure it out. Hopefully this helps someone out though.
This is the model and newdata that I am using, though it is actually a model averaged version. Same predictors though.
ma10 <- clmm(Location3 ~ Sex * Grass3 + Sex * Forb3 + (1|Tag_ID), data =
IP_all_dunes)
ma_1 <- model.avg(ma10, ma8, ma5)##top 3 models
new_ma<- data.frame(Sex = c("m","f","m","f","m","f","m","f"),
Grass3 = c("1","1","1","1","0","0","0","0"),
Forb3 = c("0","0","1","1","0","0","1","1"))
# Arguments:
# - model = a clmm model
# - modelAvg = a clmm model average (object of class averaging)
# - newdata = a dataframe of new data to apply the model to
# Returns a dataframe of predicted probabilities for each row and response level
fake.predict.clmm <- function(modelAvg, newdata) {
# Actual prediction function
pred <- function(eta, theta, cat = 1:(length(theta) + 1), inv.link = plogis) {
Theta <- c(-1000, theta, 1000)
sapply(cat, function(j) inv.link(Theta[j + 1] - eta) - inv.link(Theta[j] -
eta))
}
# Multiply each row by the coefficients
#coefs <- c(model$beta, unlist(model$ST))##turn off if a model average is used
beta <- modelAvg$coefficients[2,3:12]
coefs <- c(beta, unlist(modelAvg$ST))
xbetas <- sweep(newdata, MARGIN=2, coefs, `*`)
# Make predictions
Theta<-modelAvg$coefficients[2,1:2]
#pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=model$Theta))
pred.mat <- data.frame(pred(eta=rowSums(xbetas), theta=Theta))
#colnames(pred.mat) <- levels(model$model[,1])
a<-attr(modelAvg, "modelList")
colnames(pred.mat) <- levels(a[[1]]$model[,1])
pred.mat
}

How to get the corr(u_i, Xb) for panel data fixed effects regression in R

I am trying to develop a fixed effect regression model for a panel data using the plm package in R. I want to get the correlation between fixed effects and the regressors. Something like the corr(u_i, Xb) that comes in the Stata output.
How to get it in R?
I have tried the following (using the in-built dataset in the plm package):-
data("Grunfeld", package = "plm")
library(plm)
# build the model
gi <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# extract the fixed effects fixef(gi)
summary(fixef(gi))
fixefs <- fixef(gi)[index(gi, which = "id")] ## get the fixed effects
newdata <- as.data.frame(cbind(fixefs, Grunfeld$value, Grunfeld$capital))
colnames(newdata) <- c("fixed_effects", "value", "capital")
cor(newdata)
EDIT: I asked this question on cross validated first and I got this reply- "Questions that are solely about programming or carrying out an operation within a statistical package are off-topic for this site and may be closed." Since my question has more to do with a operation in a package, so I guess this is the right place!
How about the following considering functions of plm:
# Run the model
gi <- plm(inv ~ value + capital, data = Grunfeld, model = "within")
# Get the residuals (res) and fixed effects (fix)
res = residuals(gi)
fix = fixef(gi)
# Aggregate residuals and fixed effects
newdata = cbind(res, fix)
# Correlation
cor(newdata)
res fix
res 1.00000000 0.05171279
fix 0.05171279 1.00000000

How to create a confusion matrix for a decision tree model

I am having some difficulties creating a confusion matrix to compare my model prediction to the actual values. My data set has 159 explanatory variables and my target is called "classe".
#Load Data
df <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("NA","#DIV/0!",""))
#Split into training and validation
index <- createDataPartition(df$classe, times=1, p=0.5)[[1]]
training <- df[index, ]
validation <- df[-index, ]
#Model
decisionTreeModel <- rpart(classe ~ ., data=training, method="class", cp =0.5)
#Predict
pred1 <- predict(decisionTreeModel, validation)
#Check model performance
confusionMatrix(validation$classe, pred1)
The following error message is generated from the code above:
Error in confusionMatrix.default(validation$classe, pred1) :
The data must contain some levels that overlap the reference.
I think it may have something to do with the pred1 variable that the predict function generates, it's a matrix with 5 columns while validation$classe is a factor with 5 levels. Any ideas on how to solve this?
Thanks in advance
Your prediction is giving you a matrix of probabilities for each class. If you want to be returned the "winner" (predicted class), replace your predict line with this:
pred1 <- predict(decisionTreeModel, validation, type="class")

GLMNET prediction with intercept

I have two questions about prediction using GLMNET - specifically about the intercept.
I made a small example of train data creation, GLMNET estimation and prediction on the train data (which I will later change to Test data):
# Train data creation
Train <- data.frame('x1'=runif(10), 'x2'=runif(10))
Train$y <- Train$x1-Train$x2+runif(10)
# From Train data frame to x and y matrix
y <- Train$y
x <- as.matrix(Train[,c('x1','x2')])
# Glmnet model
Model_El <- glmnet(x,y)
Cv_El <- cv.glmnet(x,y)
# Prediction
Test_Matrix <- model.matrix(~.-y,data=Train)[,-1]
Test_Matrix_Df <- data.frame(Test_Matrix)
Pred_El <- predict(Model_El,newx=Test_Matrix,s=Cv_El$lambda.min,type='response')
I want to have an intercept in the estimated formula. This code gives an error concerning the dimensions of the Test_Matrix matrix unless I remove the (Intercept) column of the matrix - as in
Test_Matrix <- model.matrix(~.-y,data=Train)[,-1]
My questions are:
Is it the right way to do this in order to get the prediction - when I want the prediction formula to include the intercept?
If it is the right way: Why do I have to remove the intercept in the matrix?
Thanks in advance.
The matrix x you were feeding into the glmnet function doesn't contain an intercept column. Therefore, you should respect this format when constructing your test matrix: i.e. just do model.matrix(y ~ . - 1, data = Train).
By default, an intercept is fit in glmnet (see the intercept parameter in the glmnet function). Therefore, when you called glmnet(x, y), you are technically doing glmnet(x, y, intercept = T). Thus, even though your x matrix didn't have an intercept, one was fit for you.
If you want to predict a model with intercept, you have to fit a model with intercept. Your code used model matrix x <- as.matrix(Train[,c('x1','x2')]) which is intercept-free, therefore if you provide an intercept when using predict, you get an error.
You can do the following:
x <- model.matrix(y ~ ., Train) ## model matrix with intercept
Model_El <- glmnet(x,y)
Cv_El <- cv.glmnet(x,y)
Test_Matrix <- model.matrix(y ~ ., Train) ## prediction matrix with intercept
Pred_El <- predict(Model_El, newx = Test_Matrix, s = Cv_El$lambda.min, type='response')
Note, you don't have to do
model.matrix(~ . -y)
model.matrix will ignore the LHS of the formula, so it is legitimate to use
model.matrix(y ~ .)

Resources