I have a small N large T panel which I am estimating via plm::plm (panel linear regression model), with fixed effects.
Is there any way to get predicted values for a new dataset? (I want to
estimate parameters on a subset of my sample, and then use these to
calculate model-implied values for the whole sample).
There are (at least) two methods in the package to produce estimates from plm objects:
-- fixef.plm: Extract the Fixed Effects
-- pmodel.response: A function to extract the model.response
It appears to me that the author(s) are not interested in providing estimates for the "random effects". It may be a matter of "if you don't know how to do it on your own, then we don't want to give you a sharp knife to cut yourself too deeply."
I wrote a function called predict.out.plm that can create predictions for the original data and for a manipulated data set (with equal column names).
The predict.out.plm calculates a) the predicted (fitted) outcome of the transformed data and b) constructs the according to level outcome. The function works for First Difference (FD) estimations and Fixed Effects (FE) estimations using plm. For FD it creates the differenced outcome over time and for FE it creates the time-demeaned outcome.
The function is largely untested, and probably only works with strongly balanced data frames.
Any suggestions and corrections are very welcome. Help to develop a small R package would be very appreciated.
The function predict.out.plm
predict.out.plm<-function(
estimate,
formula,
data,
model="fd",
pname="y",
pindex=NULL,
levelconstr=T
){
# estimate=e.fe
# formula=f
# data=d
# model="within"
# pname="y"
# pindex=NULL
# levelconstr=T
#get index of panel data
if (is.null(pindex) && class(data)[1]=="pdata.frame") {
pindex<-names(attributes(data)$index)
} else {
pindex<-names(data)[1:2]
}
if (class(data)[1]!="pdata.frame") {
data<-pdata.frame(data)
}
#model frame
mf<-model.frame(formula,data=data)
#model matrix - transformed data
mn<-model.matrix(formula,mf,model)
#define variable names
y.t.hat<-paste0(pname,".t.hat")
y.l.hat<-paste0(pname,".l.hat")
y.l<-names(mf)[1]
#transformed data of explanatory variables
#exclude variables that were droped in estimation
n<-names(estimate$aliased[estimate$aliased==F])
i<-match(n,colnames(mn))
X<-mn[,i]
#predict transformed outcome with X * beta
# p<- X %*% coef(estimate)
p<-crossprod(t(X),coef(estimate))
colnames(p)<-y.t.hat
if (levelconstr==T){
#old dataset with original outcome
od<-data.frame(
attributes(mf)$index,
data.frame(mf)[,1]
)
rownames(od)<-rownames(mf) #preserve row names from model.frame
names(od)[3]<-y.l
#merge old dataset with prediciton
nd<-merge(
od,
p,
by="row.names",
all.x=T,
sort=F
)
nd$Row.names<-as.integer(nd$Row.names)
nd<-nd[order(nd$Row.names),]
#construct predicted level outcome for FD estiamtions
if (model=="fd"){
#first observation from real data
i<-which(is.na(nd[,y.t.hat]))
nd[i,y.l.hat]<-NA
nd[i,y.l.hat]<-nd[i,y.l]
#fill values over all years
ylist<-unique(nd[,pindex[2]])[-1]
ylist<-as.integer(as.character(ylist))
for (y in ylist){
nd[nd[,pindex[2]]==y,y.l.hat]<-
nd[nd[,pindex[2]]==(y-1),y.l.hat] +
nd[nd[,pindex[2]]==y,y.t.hat]
}
}
if (model=="within"){
#group means of outcome
gm<-aggregate(nd[, pname], list(nd[,pindex[1]]), mean)
gl<-aggregate(nd[, pname], list(nd[,pindex[1]]), length)
nd<-cbind(nd,groupmeans=rep(gm$x,gl$x))
#predicted values + group means
nd[,y.l.hat]<-nd[,y.t.hat] + nd[,"groupmeans"]
}
if (model!="fd" && model!="within") {
stop('funciton works only for FD and FE estimations')
}
}
#results
results<-p
if (levelconstr==T){
results<-list(results,nd)
names(results)<-c("p","df")
}
return(results)
}
Testing the the function:
##packages
library(plm)
##test dataframe
#data structure
N<-4
G<-2
M<-5
d<-data.frame(
id=rep(1:N,each=M),
year=rep(1:M,N)+2000,
gid=rep(1:G,each=M*2)
)
#explanatory variable
d[,"x"]=runif(N*M,0,1)
#outcome
d[,"y"] = 2 * d[,"x"] + runif(N*M,0,1)
#panel data frame
d<-pdata.frame(d,index=c("id","year"))
##new data frame for out of sample prediction
dn<-d
dn$x<-rnorm(nrow(dn),0,2)
##estimate
#formula
f<- pFormula(y ~ x + factor(year))
#fixed effects or first difffernce estimation
e<-plm(f,data=d,model="within",index=c("id","year"))
e<-plm(f,data=d,model="fd",index=c("id","year"))
summary(e)
##fitted values of estimation
#transformed outcome prediction
predict(e)
c(pmodel.response(e)-residuals(e))
predict.out.plm(e,f,d,"fd")$p
# "level" outcome prediciton
predict.out.plm(e,f,d,"fd")$df$y.l.hat
#both
predict.out.plm(e,f,d,"fd")
##out of sampel prediciton
predict(e,newdata=d)
predict(e,newdata=dn)
# Error in crossprod(beta, t(X)) : non-conformable arguments
# if plm omits variables specified in the formula (e.g. one year in factor(year))
# it tries to multiply two matrices with different length of columns than regressors
# the new funciton avoids this and therefore is able to do out of sample predicitons
predict.out.plm(e,f,dn,"fd")
plm has now a predict.plm() function, although it is not documented/exported.
Note also that predict works on the transformed model (i.e. after doing the within/between/fd transformation), not the original one. I speculate that the reason for this is that it is more difficult to do prediction in a panel data framework. Indeed, you need to consider whether you are predicting:
new time periods, for existing individual and you used a individual-FE? Then you can add the prediction to the existing individual mean
new time periods, for new individual? Then you need to figure out which individual mean you are going to use?
the same is even more complicated is you use a random-effect model, as the effects are not easily derived
In the code below, I illustrate how to use fitted values, on the existing sample:
library(plm)
#> Loading required package: Formula
library(tidyverse)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
## produce a dataset of prediction, added to the group means
Produc_means <- Produc %>%
mutate(y = log(gsp)) %>%
group_by(state) %>%
transmute(y_mean = mean(y),
y = y,
year = year) %>%
ungroup() %>%
mutate(y_pred = predict(zz) + y_mean) %>%
select(-y_mean)
## plot it
Produc_means %>%
gather(type, value, y, y_pred) %>%
filter(state %in% toupper(state.name[1:5])) %>%
ggplot(aes(x = year, y = value, linetype = type))+
geom_line() +
facet_wrap(~state) +
ggtitle("Visualising in-sample prediction, for 4 states")
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
Created on 2018-11-20 by the reprex package (v0.2.1)
Looks like there is a new package to do in-sample predictions for a variety of models including plm
https://cran.r-project.org/web/packages/prediction/prediction.pdf
You can calculate the residuals via residuals(reg_name). From here, you can subtract them from your response variable and get the predicted values.
Related
I have three questions about the sample code below which illustrates the use of step_lencode_mixed.
I read in the vignette that: "For each factor predictor, a generalized linear model is fit to the outcome and the coefficients are returned as the encoding."
In the output from the example below the column 'partial' is the return from step_lencode_mixed. My questions:
Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?
Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variabele "where_town" returned as partial?
If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason I suppose the outcome is not a coefficient?
Thanks a lot!
# ------------------------------------------------------------------------------
# Feature Engineering and Selection: A Practical Approach for Predictive Models
# by Max Kuhn and Kjell Johnson
#
# ------------------------------------------------------------------------------
#
# Code for Section 5.4 at
# https://bookdown.org/max/FES/categorical-supervised-encoding.html
#
# ------------------------------------------------------------------------------
#
# Code requires these packages:
library(tidymodels)
library(embed)
# Create example data ----------------------------------------------------------
load("../Data_Sets/OkCupid/okc.RData")
load("../Data_Sets/OkCupid/okc_binary.RData")
options(width = 120)
partial_rec <-
recipe(Class ~ ., data = okc_train) %>%
step_lencode_mixed(
where_town,
outcome = vars(Class)
) %>%
prep()
okc_train2 <- okc_train %>% select(where_town, Class)
partial_rec2 <-
recipe(Class ~ ., data = okc_train2) %>%
step_lencode_mixed(
where_town,
outcome = vars(Class)
) %>%
prep()
# Organize results -------------------------------------------------------------
partial_pooled <-
tidy(partial_rec, number = 1) %>%
dplyr::select(-terms, -id) %>%
setNames(c("where_town", "partial"))
partial_pooled <- left_join(partial_pooled, okc_props)
partial_pooled2 <-
tidy(partial_rec2, number = 1) %>%
dplyr::select(-terms, -id) %>%
setNames(c("where_town", "partial"))
all.equal(partial_pooled, partial_pooled2)
>
[1] TRUE
Should I use this partial as encoded catagorical variabele "where_town" in the new model to be fitted?
Yes. You don't have to do it manually though. The bake() function does that for you automatically (same as if you include the recipe in a workflow)
Is there a complete model (Class ~ ., data = okc_train) with all variables on Class fitted in the background and is the contribution from variable "where_town" returned as partial?
Yes. There is more information in the tidymodels book (section 17.3).
If I convert the partial with the logit2prob function, I notice that the outcome is almost identical to the rate. For that reason, I suppose the outcome is not a coefficient?
A simpler method to do the conversion to the rate is binomial()$linkinv(partial_pooled$partial).
The value given in the partial column is the log-odds value (hence the negative numbers); we use logistic regression (mixed model) to estimate. It uses an empirical Bayes estimation method that shrinks the coefficient estimates toward the overall (population) estimate.
The amount of shrinkage, for this model, is based on a few things but is mostly driven by the per-category sample size. Smaller sample sizes are affected more than categories with larger amounts of data. So the raw and shrunken estimates for berkeley are about the same since there were 2676 data points there but belvedere_tiburon has larger differences in estimates because the sample size was 35.
I appreciate broom.mixed ability to capture mixed-effects modeling in nice tidy formats. In assessing assumptions for the linear mixed effects model, I am finding that the augment function is particularly useful. However, the documentation fails to state what all the columns are for augment.merMod().
library(lme4)
library(broom.mixed)
set.seed(101)
dd <- expand.grid(f1 = factor(1:3),
f2 = LETTERS[1:2], g=1:9, rep=1:15,
KEEP.OUT.ATTRS=FALSE)
summary(mu <- 5*(-4 + with(dd, as.integer(f1) + 4*as.numeric(f2))))
dd$y <- rnbinom(nrow(dd), mu = mu, size = 0.5)
m.nb <- glmer.nb(y ~ f1*f2 + (1|g), data=dd, verbose=FALSE)
head(augment(m.nb))
Here is what the documentation says:
augment returns one row for each original observation, with columns (each prepended by a .) added. Included are the columns
.fitted predicted values
.resid residuals
.fixed predicted values with no random effects
Also added for "merMod" objects, but not for "mer" objects, are values from the response object within the model (of type lmResp, glmResp, nlsResp, etc). These include ".mu", ".offset", ".sqrtXwt", ".sqrtrwt", ".eta".
What are these columns: ".mu", ".sqrtXwt", ".sqrtrwt", ".eta" ? Is .fitted the predicted values on the model scale? And .mu on the response scale (in other words, the inverse link function is applied to predicted values)?
My confusion matrix created for a logistic regression model only has the values for Predicted-FALSE. Even though I adjusted my threshold, it does not do much to the matrix. What is wrong and how do I adjust the threshold? Below is the code for the training set and the result. "Retain" is my dependent variable with 1=retained 0=not retained, and all the independent variables are continuous variables. I have overall 170K records in the dataset (df). This matrix indicates that the model predicts that no one retained, which is odd, because in reality 45% retained.
model_1 <- glm(retain~ age_2010+cnt_total_funds+sum_MS_2010+tenure_2010, data=df, family="binomial")
res <- predict(model_1, training, retain="response")
(table(ActualValue=training$retain, PredictedValue=res>0.05))
PredictedValue
ActualValue FALSE
0 96006
1 43676
You made a mistake inside predict function as you want to use type argument (not retain which not exists for this function).
I use a sample data to show you working example.
In your example change retain="response" to type="response".
aa <- airquality
aa$retain <- aa$Ozone > 50
gg = glm(retain ~ Solar.R + Month, data = aa, family = "binomial")
range(predict(gg, aa, type = "response"), na.rm = TRUE)
#> [1] 0.05918388 0.48769632
Created on 2021-06-18 by the reprex package (v2.0.0)
I am using lapply to perform several glm regressions on one dependent variable by one independent variable at a time. but I'm not sure how to extract the P values at a time.
There are 200 features in my dataset, but the code below only gave me the P value of feature#1. How can I get a matrix of all P values of the 200 features?
valName<- as.data.frame(colnames(repeatData))
featureName<-valName[3,]
lapply(featureName,
function(var) {
formula <- as.formula(paste("outcome ~", var))
fit.logist <- glm(formula, data = repeatData, family = binomial)
summary(fit.logist)
Pvalue<-coef(summary(fit.logist))[,'Pr(>|z|)']
})
I
I simplified your code a little bit; (1) used reformulate() (not really different, just prettier) (2) returned only the p-value for the focal variable (not the intercept p-value). (If you leave out the 2, you'll get a 2-row matrix with intercept and focal-variable p-values.)
My example uses the built-in mtcars data set, with an added (fake) binomial response.
repeatData <- data.frame(outcome=rbinom(nrow(mtcars), size=1, prob=0.5), mtcars)
ff <- function(var) {
formula <- reformulate(var, response="outcome")
fit.logist <- glm(formula, data = repeatData, family = binomial)
coef(summary(fit.logist))[2, 'Pr(>|z|)']
}
## skip first column (response variable).
sapply(names(repeatData)[-1], ff)
My goal is to create multiple models and then using a novel dataset, create prediction values for that new data set and the corresponding prediction intervals around each of those new fitted points.
Pulling in libraries:
library(purrr)
library(dplyr)
library(modelr)
Assigning data_1 as the DNase data set from R:
data_1 <- DNase
Creating one unique model for each run:
model_dna <- data_1 %>% group_by(Run) %>%
do(model_dna = lm(conc ~ density, data = .)) %>% ungroup()
I then want to predict a set of points with a new data set, lets call it data_2, for each model, and then build prediction intervals around each fitted point (the upper and lower bounds of the prediction interval for each point, as produced by the function predict() when the argument interval = "prediction" is included. I successfully generated fitted values like this:
data_2 <- map(model_dna$model_dna, ~ spread_predictions(data = data_2, models = .x)
But then struggle to add in the "upr" and "lwr" columns for these newly fitted values. Is there a way to perhaps simultaneously "spread_prediction_intervals" when fitting these new points? It would be very helpful to understand how to do this for multiple data sets, as well as if given the model which was used to generate predicted values and a set of those predicted values, be able to then produce the upr and lwr bounds of the prediction interval. Thank you very much for your help in advance.
Apparently you can construct the confidence intervals yourself using the results of predict.
data_2 <- map(model_dna$model_dna, function(x) {
preds=predict(x, data_1, se.fit=TRUE)
mutate(data_1, fit=preds$fit, lwr=fit-preds$se.fit*1.96, upr=fit+preds$se.fit*1.96)
})
If you don't care about the confidence intervals, you can use map with add_predictions or use spread_predictions to create one big data frame.
data_2 <- map(model_dna$model_dna, ~ add_predictions(data = data_1, model = .x))
data_2=spread_predictions(data_1, mods[[1]], mods[[2]], mods[[3]], mods[[4]], mods[[5]], mods[[6]],
mods[[7]], mods[[8]], mods[[9]], mods[[10]], mods[[11]])