ggforest error - undefined columns selected - r

I am trying to make a forrest plot for my model with ggforest().
Here is the code to create mock data to reproduce the problem.
Data is formatted according to Therneau for time dependent covariates. I guess this might be the reason why ggforest does not operate properly.
library(survival)
library(survminer)
set.seed(1)
repetitions<-floor(sample(rnorm(1:10, 10)))
id<-rep(1:10, times=repetitions )
age<-rep(floor(sample(18:80,10)),times=repetitions)
diabetes<-rep(sample(0:1,10,replace=TRUE), times=repetitions)
bil<-sample(4:60,length(id), replace=TRUE)
status<-rep(1,length(id))
indices<-vector(length=10)
for(i in 1:10){
indices[i]<-sum(repetitions[1:i])
}
status[indices]<-2
daystart <- vector()
a<-vector()
for(i in 1:10){
if(i==1){ daystart<-1:indices[i]
} else {a<-1:(indices[i]-indices[i-1])
}
daystart<-c(daystart,a)
}
dayend<-daystart+1
mock_data<-cbind.data.frame(id,age,diabetes, bil, daystart, dayend, status)
mock_data$agegroup<-cut(mock_data$age, 2)
fit2<-coxph(Surv(daystart,dayend, status)~bil+diabetes+strata(agegroup), data=mock_data)
ggforest(fit2 , data=mock_data)
I get
Error in [.data.frame(data, ,var ) : undefined columns selected.
I tried installing previous version of package broom ( version 0.5.6) as, as suggested in previous threads, but it didnt resolve the issue. R versions 3.6.1 and 4.1.1 were used. Any ideas?
EDIT: So, the ggforest() gets confused with +strata(). Removing +strata() produces a plot.

So, the problem was in this row in ggforest() function.
terms <- attr(model$terms, "dataClasses")[-1]
I just did the quick fix and copy-pasted the body of the function
in the new function I created, adding index "-4", in order not to add
strata attributes to terms.
I guess the original function might be changed to accomodate this and exclude
strata from terms, but I should stress that I am not great at math or statistics,
so I am not 100 % sure if stratifying the data for time-varying cox proportional hazards analysis is valid if I stratify by continuous variable such as age. That would end up with each strata containing the data for only several individuals with the same age, each having repeated measurements values.

Related

Fama MacBeth regression pmg function error in R

I've been trying to run a Fama Macbeth regression using the pmg function for my data "Dev_Panel" but I keep getting this error message:
Fehler in pmg(BooktoMarket ~ Returns + Profitability + BEtoMEpersistence, :
Insufficient number of time periods
I've read in other posts on here that this could be due to NAs in the data. But I've already removed these from the panel.
Additionally, I've used the pmg function on the data frame "Em_Panel" for which I have undertaken the exact same data cleaning measures as for the "Dev_Panel". The regression for this panel worked, but it only produces a coefficient for the intercept. The other coefficients are NA.
Here's the code I used for the Em_Panel:
require(foreign)
require(plm)
require(lmtest)
Em_Panel <- read.csv2("Em_Panel.csv", na="NA")
FMR_Em <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Em_Panel, index = c("companyID", "years"))
And here's the code for the Dev_Panel:
Em_Panel <- read.csv2("Dev_Panel.csv", na="NA")
FMR_Dev <- pmg(BooktoMarket~Returns+Profitability+BEtoMEpersistence, Dev_Panel, index = c("companyID", "years"))
Since this seemingly is a problem concerning my data I will gladly provide it:
http://www.filedropper.com/empanel
http://www.filedropper.com/devpanel
Thank you so much for any help!!!
Edit
After switching the arguments as suggested the error is now produced by the Dev_Panel and not the Em_Panel.
Also the regression for the Em_Panel now only provides a coefficient for the intercept. The other coefficients are NA.

Predicting data from gamlss model in handler function using tryCatch in R

I am having a problem using the tryCatch() function in R in a function I created.
What I want to do is this:
simulate data based on model results
analyze simulated data using my gamlss model
use the predict function to extract model predictions over a new range of values
store these predictions in a data frame
do this many times
My main problem is that my model is somewhat unstable and once in a while predictions are kind of wild, which in turn generates an error when I try to analyze it with gamlss. My objective is to write a tryCatch statement within my simulation function and to basically simply run the simulation/prediction code a second time in the event that an error occurs. (I know this is not optimal, I could also write it in a recursive statement using repeat for example and run it until I don't get an error but I get few enough errors that the probability of getting two in a row is quite low, and I'm having enough troube with this task as it is.)
So I simplified my code as much as I could and created a dummy dataframe for which the modelling still works.
I wrote in the code where I believe the error is (with the predict function which does not find the mod_sim object). It is likely there since the cat just above this line prints while the one just below doesn't print.
I think there are some things about how tryCatch works that I don't understand well enough and I'm having a hard time to understand which objects are kept in which parts of functions and when they can be called or not...
Here is the code I have so far. The error occurs at l.84 (identified in the script). The data and code can be found here.
library(tidyverse)
library(gamlss)
library(gamlss.dist)
#Load data
load('DHT.RData')
#Run original model
mod_pred<-gamlss(harvest_total ~ ct,
data = DHT,
family = DPO)
#Function to compute predictions based on model
compute_CI_trad_gamlss<-function(n.sims=200, mod){#,
#DF for simulations
df_sims<-as.data.frame(DHT)
#Dateframe with new data to predict over
new.data.ct<<-expand.grid(ct=seq(from=5, to=32, length.out=50))
#matrix to store predictions
preds.sim.trad.ct <<- matrix(NA, nrow=nrow(new.data.ct), ncol=n.sims)
#Number of obs to simulate
n<-nrow(df_sims)
#Simulation loop (simulate, analyze, predict, write result)
for(i in 1:n.sims){
#Put in tryCatch to deal with potential error on first run
tryCatch({
#Create matrix to store results of simulation
y<-matrix(NA,n,1)
#in DF for simulations, create empty row to be filled by simulated data
df_sims$sim_harvest<-NA
#Loop to simulate observations
for(t in 1:n){
#Simulate data based on model parameters
y[t]<-rDPO(n=1, mu=mod$mu.fv[t], sigma = mod$sigma.fv[t])
}#enf of simulation loop
#Here I want the result of the simulation loop to be pasted in the df_sims dataset
df_sims$sim_harvest<-y
#Analysis of simulated data
mod_sim<-gamlss(sim_harvest ~ ct,
data = df_sims,
family = DPO)
#Refit the model if convergence not attained
if(mod_sim$converged==T){
#If converged do nothing
} else {
#If not converged refit model
mod_sim<-refit(mod_sim)
}
cat('we make it to here!\n')
#Store results in object
ct <<-as.vector(predict(mod_sim, newdata = new.data.ct, type='response'))
cat('but not to here :( \n')
#If we made it down here, register err as '0' to be used in the if statement in the 'finally' code
err<<-0
},
#If error register the error and write it!
error = function(e) {
#If error occured, show it
cat('error at',i,'\n')
#Register err as 1 to be used in the if statement in the finally code below
err<<-1
},
finally = {
if(err==0){
#if no error, do nothing and keep going outside of tryCatch
}#End if err==0
else if (err==1){
#If error, re-simulate data and do the analysis again
y<-matrix(NA,n,1)
df_sims$sim_harvest<-NA
#Loop to simulate observations
for(t in 1:n){
#Simuler les données basées sur les résultats du modèle
y[t]<-rDPO(n=1, mu=mod$mu.fv[t], sigma = mod$sigma.fv[t])
}#enf of simulation loop
#Here I want the result of the simulation loop to be pasted in the df_sims dataset
df_sims$sim_harvest<-y
#Analysis of simulated data
mod_sim<-gamlss(sim_harvest ~ ct,
data = df_sims,
family = DPO)
cat('we also make it here \n')
#Store results in object
ct <<-as.vector(predict(mod_sim, newdata = new.data.ct, type='response'))
cat('but not here... \n')
}#End if err==1,
}#End finally
)#End tryCatch
#Write predictions for this iteration to the DF and start over
preds.sim.trad.ct[,i] <<-ct
#Show iteration number
cat(i,'\n')
}
#Do some more stuff here
#Return results
return(preds = list(ct= list(predictions=preds.sim.trad.ct)))
}
#Run simulation and store object
result<-compute_CI_trad_gamlss(n.sims=20, mod=mod_pred)
Anyway I hope someone can help!
Thanks a lot!
So after a bit of trial and error I managed to make it work. I believe the problem lies in the mod_sim object that is not saved to the global environment. predict (or predict.gamlss here) is probably not looking in the function environment for the mod_sim object although I don't understand why it wouldn't. Anyway using <<- (i.e. assigning the object in the global environment from the function) for every object created in the function seemed to do the trick. If anyone has an explanation on why this happens though I'd be glad to understand what I'm doing wrong!

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

'Invalid parent values' error when running JAGS from R

I am running a simple generalized linear model, calling JAGS from R. The model is negatively binomially distributed. The model is being fitted to data on counts of fish, with the majority of individual counts ('C' in the data set below) being zeros.
I initially ran the model with one covariate, temperature ('Temp'). About half of the time the model ran and the other half of the time the model gave me the error, 'Error in node C[###] Invalid parent values.' The value for C[###] in the error message changes with each successive attempt to run the model.
Since my success at running the model was inconsistent, I tried adding another covariate, salinity ('Salt'). Then the model would not run at all, with the same error message as above.
Any ideas or suggestions on the source of the error are greatly appreciated.
I am suspecting that the initial values for the dispersion parameter, r, may be the issue. Ideally I add several more covariates into model fitting if this error can be addressed.
The data set and code are immediately below. For sake of getting the data to load properly on this website, I have omitted 662 of the 672 total values; even with the reduced data set (n = 10 instead of n = 672) the problem remains.
Thank you.
setwd("C:/Users/John/Desktop")
library('coda')
library('rjags')
library('R2jags')
set.seed(1000000000)
#data
n=10
C=c(0,0,0,0,0,1,0,0,0,1)
Temp=c(0,29.3,25.3,28.7,28.7,24.4,25.1,25.1,24.2,23.3)
Salt=c(6,6,0,6,6,0,12,12,6,12)
sink("My Model.txt")
cat("
model {
r~dunif(0,10)
beta0~dunif (-20,20)
beta1~dunif (-20,20)
beta2~dunif (-20,20)
for (i in 1:n) {
C[i] ~ dnegbin(p[i], r)
p[i] <- r/(r+lambda[i])
log(lambda[i]) <- mu[i]
mu[i] <- beta0 + beta1*Temp[i] + beta2*Salt[i]
}
}
", fill=TRUE)
sink()
n=n
C=C
Temp=Temp
Salt=Salt
#bundle data
bugs.data = list(
"n",
"C",
"Temp",
"Salt")
#parameters to monitor
params<-c(
"r",
"beta0",
"beta1",
"beta2")
#initial values
inits <- function(){list(
r=floor(runif(1,0,5)),
beta0=runif(1,-5,5),
beta1=runif(1,-5,5),
beta2=runif(1,-5,5))}
model.file <- 'My Model.txt'
jagsfit <- jags(data=bugs.data, inits=inits, params, n.iter=1000, n.thin=10, n.burnin=100, model.file)
print(jagsfit, digits=5)
This works fine for me most of the time, but it would fail with the error you describe if the inits function samples a value of r of 0 - which you have made more likely by using floor() in the inits function (not sure why you did that - r is not restricted to integers but is strictly positive). Also, every time you run the model you will get different initial values (unless setting a random seed in R) which is making your life more complicated that it needs to be. I generally recommend picking fixed (and probably over dispersed) initial values, such as r=0.01 and r=10 for the two chains in your example.
However, JAGS picks usable initial values for this model as you can see by not providing your own inits e.g.:
library('runjags')
listdata <- lapply(bugs.data, get)
names(listdata) <- unlist(bugs.data)
run.jags(model.file, params, listdata)
I would also have a think about the prior you are using for r - it could well be that this will have a bigger effect on your posterior than intended. Another (not necessarily better) option is something like a gamma prior.
Matt

How to use a string as a formula in r

I'm trying to do an ANOVA of all of my data frame columns against time_of_day which is a factor. The rest of my columns are all doubles and of equal length.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(paste(i, "~ time_of_day"), data = data_in)
}
x = x+1
}
dev.off()
Running this code gives me this error:
Error: $ operator is invalid for atomic vectors
Where is my code calling $? How can I fix this? Sorry, I'm new to r and am quite lost.
My research question is to see if time of day has an affect on brain volume at different ROIs in the brain. Time of day is divided into three categories of morning, afternoon or night.
Edit: SOLVED
treating the string as a formula will allow this to run although I have been advised to not have this many independent values as it will inflate the statistical results of the model. I am not removing this incase someone has a similar problem with the aov() call.
x = 0
pdf("Time_of_Day.pdf")
for (i in names(data_in)){
if(x > 9){
test <- aov(as.formula(paste(i, "~ time_of_day")), data = data_in)
}
x = x+1
}
dev.off()
I guess your problem is that you don't have an ANOVA formula integrated into your aov() function. See the following working example:
data_in <- data.frame(c(1,2,3),c(4,5,6),c(7,8,9))
names(data_in) <- c("first","second","third")
for (i in seq_along(names(data_in))){
test <- aov(data_in$first ~ data_in$second, data = data_in)
print(summary(test))
}
However, it seems that you tried to calculate an ANOVA for each column, whereas you need at least two variables. That is, a nominal scaled condition variable and an interval scaled dependent variable (e.g. gender and weight). So I'm generally wondering if an ANOVA is the correct method for your question. Anyways, in order to answer this question, sample data and a summary of your research question would be needed.

Resources