Calculating ROC for panel data and Linear Probability Model - r

I have panel data from external assets of 102 countries over ~ 20-40 years, depending on the country.
I tried predicting the probability for a financial crisis, depending on log(total_liabilities to see whether an increase in foreign investment and other capital positions can help predict a crisis.
plm1 <- plm(crisis ~ log_total_liabilities + lag1_log_tot_lia + lag2_log_tot_lia + lag3_log_tot_lia
+ factor(year) + factor(country), data = dt2, index=c("year", "country"), model="pooling")
summary(plm1)
I started by estimating a plm model, regressing on my crisis dummy.
To estimate the predictive ability, I wanted to generate a ROC and AUC value, given the regression
# Plot of True Positive Rate Against the False Positive Rate
pred1 <- predict(plm1)
pred2 <- prediction(pred1,as.numeric(plm1$crisis))
plot(performance(pred2,"tpr","fpr"), las=0, main="plm1")
I get errors like:
Error: not fitting arguments / variables" (translated from German) or
"all arguments/variables need to have the same length" (translated
from German).
Another approach to obtaining Roc values would start with
When changing pred1 <- predict(plm1, dt2) (dt2 is my data frame, containing also some variables I had not used in the plm1 regression), the error differs:
The format of predictions is invalid. It couldn't be coerced to a list.
Are PLMs simply not made for ROC calculations? And if so, how come that the paper attached presents AUROC values for a linear probability model with fixed effects? (See second last row)
And if no, what am I doing wrong?
I attached the screenshot of the paper and my dataset.
CSV File with datasat
Screenshot of paper with OLS AUROC value

AUC-ROC only works for only binary classification problems. As you used a fixed effects regression, the predicted values produced after plm1, pred1, is a continuous one.

Related

Gamma distribution in a GLMM

I am trying to create a GLMM in R. I want to find out how the emergence time of bats depends on different factors. Here I take the time difference between the departure of the respective bat and the sunset of the day as dependent variable (metric). As fixed factors I would like to include different weather data (metric) as well as the reproductive state (categorical) of the bats. Additionally, there is the transponder number (individual identification code) as a random factor to exclude inter-individual differences between the bats.
I first worked in R with a linear mixed model (package lme4), but the QQ plot of the residuals deviates very strongly from the normal distribution. Also a histogram of the data rather indicates a gamma distribution. As a result, I implemented a GLMM with a gamma distribution. Here is an example with one weather parameter:
model <- glmer(formula = difference_in_min ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gamma(link = log))
However, since there was no change in the QQ plot this way, I looked at the residual diagnostics of the DHARMa package. But the distribution assumption still doesn't seem to be correct, because the data in the QQ plot deviates very much here, too.
Residual diagnostics from DHARMa
But if the data also do not correspond to a gamma distribution, what alternative is there? Or maybe the problem lies somewhere else entirely.
Does anyone have an idea where the error might lie?
But if the data also do not correspond to a gamma distribution, what alternative is there?
One alternative is called the lognormal distribution (https://en.wikipedia.org/wiki/Log-normal_distribution)
Gaussian (or normal) distributions are typically used for data that are normally distributed around zero, which sounds like you do not have. But the lognormal distribution does not have the same requirements. Following your previous code, you would fit it like this:
model <- glmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl, family=gaussian(link = identity))
or instead of glmer you can just call lmer directly where you don't need to specify the distribution (which it may tell you to do in a warning message anyway:
model <- lmer(formula = log(difference_in_min) ~ repro + precipitation + (1+repro|transponder number), data = trip, control=ctrl)

Predicted Survival Curves using Corrected Group Prognosis Method

How can I plot predicted survival curves of a continuous covariate (let's say 20th and 80th percentile of the value) using the corrected group prognosis method as implemented in R by Therneau
For example,
library(survival)
library(survminer)
fit <- coxph( Surv(stop, event) ~ size + strata(rx), data = bladder )
ggadjustedcurves(fit, data=bladder, method = "conditional", strata=rx)
Now, this is useful because I am given two survival curves that are stratified by rx (either 0 or 1) and the conditional method is being acted upon the bladder data set. However, let's say I would like to use the marginal method but not stratify and instead plot my continuous covariate at 20th and 80th value but also re-balance the subpopulation. Would like any step in the right direction.
To re-state, I have a Cox model with continuous predictors. I would like to build a Cox model but not stratify on rx but have this in the model. Then, I want to pass the created Cox object into ggadjustedcurves() function with uses "subpopulation re-balancing" when given a reference data set. And then, instead of showing two survival curves stratified on a categorical variable, I want to plot two representative survival curves at the 20th and 80th percentile.
EDIT
My first attempt
fit2 <- coxph( Surv(stop, event) ~ size + rx, data = bladder ) #remove strata
fit2
# CGP
pred<- data.frame("rx" = 1, "size" = 3.2)
ggadjustedcurves(fit2, data = pred , method = "conditional", reference = bladder)
Is this what I think it is? Conditional re-balancing has been applied to the reference data set and then the predicted curves are generated for an individual with rx=1 and size of 3.2.
It is difficult to understand what you are truly looking for, but I think I have a rough idea. I think you want to plot the survival curve that would have been observed if every person in your sample had received a specific value for the continuous covariate. If there is no confounding, you can simply use a Cox model that includes only the continuous covariate and use the predict() function for a range of points in time and plot the results. If you need to adjust for confounding, you can include the confounders in the Cox model and use g-computation to obtain the desired probabilities. I describe this in a recent preprint: https://arxiv.org/pdf/2208.04644.pdf
This can be done in R using the contsurvplot package (also developed by me). First, install the package using:
devtools::install_github("RobinDenz1/contsurvplot")
Afterwards, fit your Cox model, but use x=TRUE in the coxph call:
library(survival)
library(contsurvplot)
library(riskRegression)
library(ggplot2)
fit2 <- coxph(Surv(stop, event) ~ size + rx, data=bladder, x=TRUE)
You can now call the plot_surv_lines function to obtain the causal survival curves for specific values of size, given the model. Using the horizon argument you can tell the function for which values you want to plot the survival curves. I choose the 20% and 80% quantile of size as you described:
plot_surv_lines(time="stop",
status="event",
variable="size",
data=bladder,
model=fit2,
horizon=quantile(bladder$size, probs=c(0.2, 0.8)))
The package contains a lot more plotting routines to visualize the causal effect of a continuous variable on a time-to-event outcome that might be more suitable for what you actually want.

How to get individual coefficients and residuals in panel data using fixed effects

I have a panel data including income for individuals over years, and I am interested in the income trends of individuals, i.e individual coefficients for income over years, and residuals for each individual for each year (the unexpected changes in income according to my model). However, I have a lot of observations with missing income data at least for one or more years, so with a linear regression I lose the majority of my observations. The data structure is like this:
caseid<-c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4)
years<-c(1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008,
1998,2000,2002,2004,2006,2008,1998,2000,2002,2004,2006,2008)
income<-c(1100,NA,NA,NA,NA,1300,1500,1900,2000,NA,2200,NA,
NA,NA,NA,NA,NA,NA, 2300,2500,2000,1800,NA, 1900)
df<-data.frame(caseid, years, income)
I decided using a random effects model, that I think will still predict income for missing years by using a maximum likelihood approach. However, since Hausman Test gives a significant result I decided to use a fixed effects model. And I ran the code below, using plm package:
inc.fe<-plm(income~years, data=df, model="within", effect="individual")
However, I get coefficients only for years and not for individuals; and I cannot get residuals.
To maybe give an idea, the code in Stata should be
xtest caseid
xtest income year
predict resid, resid
Then I tried to run the pvcm function from the same library, which is a function for variable coefficients.
inc.wi<-pvcm(Income~Year, data=ldf, model="within", effect="individual")
However, I get the following error message:
"Error in FUN(X[[i]], ...) : insufficient number of observations".
How can I get individual coefficients and residuals with pvcm by resolving this error or by using some other function?
My original long form data has 202976 observations and 15 years.
Does the fixef function from package plm give you what you are looking for?
Continuing your example:
fixef(inc.fe)
Residuals are extracted by:
residuals(inc.fe)
You have a random effects model with random slopes and intercepts. This is also known as a random coefficients regression model. The missingness is the tricky part, which (I'm guessing) you'll have to write custom code to solve after you choose how you wish to do so.
But you haven't clearly/properly specified your model (at least in your question) as far as I can tell. Let's define some terms:
Let Y_it = income for ind i (i= 1,..., N) in year t (t= 1,...,T). As I read you question, you have not specified which of the two below models you wish to have:
M1: random intercepts, global slope, random slopes
Y_it ~ N(\mu_i + B T + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
M2: random intercepts, random slopes
Y_it ~ N(\mu_i + \gamma_i I T, \sigma^2)
\mu_i ~ N(\phi_0, \tau_0^2)
\gamma_i ~ N(\phi_1, tau_1^2)
Also, your example data is nonsensical (see below). As you can see, you don't have enough observations to estimate all parameters. I'm not familiar with library(plm) but the above models (without missingness) can be estimated in lme4 easily. Without a realistic example dataset, I won't bother providing code.
R> table(df$caseid, is.na(df$income))
FALSE TRUE
1 2 4
2 4 2
3 0 6
4 5 1
Given that you do have missingness, you should be able to produce estimates for either hierarchical model via the typical methods, such as EM. But I do think you'll have to write the code to do the estimation yourself.

Error-message when using ClusterSEs package, command cluster.im

I have to adjust logistic regression model for cluster standard errors. For this purpose I use the package ClusterSEs and the command cluster.im.
I have to levels in the dataset Tbf2: Individual and village:
Tbf2 is my small dataset consisting of the variable Burned (binary variable; village level), Village (factor, village level) and VoteForER2 (binary, individual level).
My code is provided below:
#Make sure the data has the same length,
Tbf1 <- data.frame(cbind(Burned, Village, VoteForER2))
Tbf2 <- na.omit(Tbf1)
#Prediction of support for Authorities on Burned
###ER2 ; logistic regression
fm <- glm(Tbf2$VoteForER2 ~ Tbf2$Burned + Tbf2$, family=binomial(link="logit"))
display(fm)
#Adjusted p-values
clust.p <- cluster.im(fm, Tbf2, Village, ci.level = 0.95, report = T, drop = FALSE)
My problem is, that I keep getting the following error-message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
And I can't figure out how to solve this. I have two different levels in regression model as far as I can see myself.
I hope somebody will be able to help me!
Best,
Sofie
The cluster.im function works the following way:
Computes p-values and confidence intervals for GLM models based on cluster-specific model estimation
(Ibragimov and Muller 2010). A separate model is estimated in each cluster, and then pvalues
and confidence intervals are computed based on a t/normal distribution of the cluster-specific
estimates.
Model cannot be estimated for each cluster because the Burned variable for each village is the same for all people - the whole village either burned or not. So the model ask for more data - it says "give me at least 2 different levels of the predictor".

glmer - predict with binomial data (cbind count data)

I am trying to predict values over time (Days in x axis) for a glmer model that was run on my binomial data. Total Alive and Total Dead are count data. This is my model, and the corresponding steps below.
full.model.dredge<-glmer(cbind(Total.Alive,Total.Dead)~(CO2.Treatment+Lime.Treatment+Day)^3+(Day|Container)+(1|index),
data=Survival.data,family="binomial")
We have accounted for overdispersion as you can see in the code (1:index).
We then use the dredge command to determine the best fitted models with the main effects (CO2.Treatment, Lime.Treatment, Day) and their corresponding interactions.
dredge.models<-dredge(full.model.dredge,trace=FALSE,rank="AICc")
Then made a workspace variable for them
my.dredge.models<-get.models(dredge.models)
We then conducted a model average to average the coefficients for the best fit models
silly<-model.avg(my.dredge.models,subset=delta<10)
But now I want to create a graph, with the Total Alive on the Y axis, and Days on the X axis, and a fitted line depending on the output of the model. I understand this is tricky because the model concatenated the Total.Alive and Total.Dead (see cbind(Total.Alive,Total.Dead) in the model.
When I try to run a predict command I get the error
# 9: In UseMethod("predict") :
# no applicable method for 'predict' applied to an object of class "mer"
Most of your problem is that you're using a pre-1.0 version of lme4, which doesn't have the predict method implemented. (Updating would be easiest, but I believe that if you can't for some reason, there's a recipe at http://glmm.wikidot.com/faq for doing the predictions by hand by extracting the fixed-effect design matrix and the coefficients ...)There's actually not a problem with the predictions, which predict the log-odds (by default) or the probability (if type="response"); if you wanted to predict numbers, you'd have to multiply by N appropriately.
You didn't give one, but here's a reproducible (albeit somewhat trivial) example using the built-in cbpp data set (I do get some warning messages -- no non-missing arguments to max; returning -Inf -- but I think this may be due to the fact that there's only one non-trivial fixed-effect parameter in the model?)
library(lme4)
packageVersion("lme4") ## 1.1.4, but this should work as long as >1.0.0
library(MuMIn)
It's convenient for later use (with ggplot) to add a variable for the proportion:
cbpp <- transform(cbpp,prop=incidence/size)
Fit the model (you could also use glmer(prop~..., weights=size, ...))
gm0 <- glmer(cbind(incidence, size - incidence) ~ period+(1|herd),
family = binomial, data = cbpp)
dredge.models<-dredge(gm0,trace=FALSE,rank="AICc")
my.dredge.models<-get.models(dredge.models)
silly<-model.avg(my.dredge.models,subset=delta<10)
Prediction does work:
predict(silly,type="response")
Creating a plot:
library(ggplot2)
theme_set(theme_bw()) ## cosmetic
g0 <- ggplot(cbpp,aes(period,prop))+
geom_point(alpha=0.5,aes(size=size))
Set up a prediction frame:
predframe <- data.frame(period=levels(cbpp$period))
Predict at the population level (ReForm=NA -- this may have to be REForm=NA in lme4 `1.0.5):
predframe$prop <- predict(gm0,newdata=predframe,type="response",ReForm=NA)
Add it to the graph:
g0 + geom_point(data=predframe,colour="red")+
geom_line(data=predframe,colour="red",aes(group=1))

Resources