I am trying to use the pgmm function from the plm package for R. The regression runs and I can call up the results, however, asking for the summary gives the following error:
Error in t(y) %*% x : non-conformable arguments
I've imported the data from the World Bank using the WDI package:
library(plm) # load package
library(WDI) # Load package
COUNTRIES <- c("AGO","BEN","BWA","BFA","BDI") # Specify countries
INDICATORS <- c("NY.GDP.PCAP.KN", "SP.DYN.TFRT.IN", "SP.DYN.CBRT.IN", "SP.POP.TOTL") # Specify indicators
LONG <- WDI(country=COUNTRIES, indicator=INDICATORS, start=2005, end=2009, extra=FALSE) # Load data
PANEL <- pdata.frame(LONG, c("iso2c","year")) # Transform to PANEL dataframe
PANEL$year <- as.numeric(as.character(PANEL$year)) # Encode year
EQ <- pgmm( log(fertility) ~ log(gdp) + lag(log(fertility), 2) | lag(log(fertility), 2), data=PANEL, effect="twoways", model="twosteps", gmm.inst=~log(fertility) ) # Run regression
Calling the results as follows works.
EQ
But the summary (below) gives the error message mentioned above.
summary(EQ)
I think the error occurs because summary.pgmm tries to do a second order Arelland-Bond test of serial correlation on your data, but your data only have two points (2008 and 2009) so it fails.
To fix this problem, you could patch the function so that it checks whether you only have two points in the data set and runs the test only if you have more than two points. I provide a patched function below:
summary.pgmm.patched <- function (object, robust = FALSE, time.dummies = FALSE, ...)
{
model <- plm:::describe(object, "model")
effect <- plm:::describe(object, "effect")
transformation <- plm:::describe(object, "transformation")
if (robust) {
vv <- vcovHC(object)
}
else {
vv <- vcov(object)
}
if (model == "onestep")
K <- length(object$coefficients)
else K <- length(object$coefficients[[2]])
Kt <- length(object$args$namest)
if (!time.dummies && effect == "twoways")
rowsel <- -c((K - Kt + 1):K)
else rowsel <- 1:K
std.err <- sqrt(diag(vv))
b <- coef(object)
z <- b/std.err
p <- 2 * pnorm(abs(z), lower.tail = FALSE)
CoefTable <- cbind(b, std.err, z, p)
colnames(CoefTable) <- c("Estimate", "Std. Error", "z-value",
"Pr(>|z|)")
object$CoefTable <- CoefTable[rowsel, , drop = FALSE]
object$sargan <- sargan(object)
object$m1 <- plm:::mtest(object, 1, vv)
# The problem line:
# object$m2 <- mtest(object, 2, vv)
if (length(object$residuals[[1]] ) > 2) object$m2 <- plm:::mtest(object, 2, vv)
object$wald.coef <- plm:::wald(object, "param", vv)
if (plm:::describe(object, "effect") == "twoways")
object$wald.td <- plm:::wald(object, "time", vv)
class(object) <- "summary.pgmm"
object
}
You might want to write to the author of the plm package and show him this post. The author will be able to write a less 'hacky' patch.
Using your own (slightly modified) example data, here is how you would use the function:
library(WDI) # Load package
library(plm)
COUNTRIES <- c("AGO","BEN","BWA","BFA","BDI") # Specify countries
INDICATORS <- c("NY.GDP.PCAP.KN", "SP.DYN.TFRT.IN", "SP.DYN.CBRT.IN", "SP.POP.TOTL") # Specify indicators
LONG <- WDI(country=COUNTRIES, indicator=INDICATORS, start=2005, end=2009, extra=FALSE) # Load data
PANEL <- pdata.frame(LONG, c("iso2c","year")) # Transform to PANEL dataframe
PANEL$year <- as.numeric(as.character(PANEL$year)) # Encode year
names(PANEL) [c(4,5)] = c('gdp','fertility')
EQ <- pgmm( log(fertility) ~ log(gdp) + lag(log(fertility), 2) | lag(log(fertility), 2), data=PANEL, effect="twoways", model="twosteps", gmm.inst=~log(fertility) ) # Run regression
summary.pgmm.patched(EQ)
Related
I want to create a function which will perform panel regression with 3-level dummies included.
Let's consider within model with time effects :
library(plm)
fit_panel_lr <- function(y, x) {
x[, length(x) + 1] <- y
#adding dummies
mtx <- matrix(0, nrow = nrow(x), ncol = 3)
mtx[cbind(seq_len(nrow(mtx)), 1 + (as.integer(unlist(x[, 2])) - min(as.integer(unlist(x[, 2])))) %% 3)] <- 1
colnames(mtx) <- paste0("dummy_", 1:3)
#converting to pdataframe and adding dummy variables
x <- pdata.frame(x)
x <- cbind(x, mtx)
#performing panel regression
varnames <- names(x)[3:(length(x))]
varnames <- varnames[!(varnames == names(y))]
form <- paste0(varnames, collapse = "+")
x_copy <- data.frame(x)
form <- as.formula(paste0(names(y), "~", form,'-1'))
params <- list(
formula = form, data = x_copy, model = "within",
effect = "time"
)
pglm_env <- list2env(params, envir = new.env())
model_plm <- do.call("plm", params, envir = pglm_env)
model_plm
}
However, if I use data :
data("EmplUK", package="plm")
dep_var<-EmplUK['capital']
df1<-EmplUK[-6]
In output I will get :
> fit_panel_lr(dep_var, df1)
Model Formula: capital ~ sector + emp + wage + output + dummy_1 + dummy_2 +
dummy_3 - 1
<environment: 0x000001ff7d92a3c8>
Coefficients:
sector emp wage output
-0.055179 0.328922 0.102250 -0.002912
How come that in formula dummies are considered and in coefficients are not ? Is there any rational explanation or I did something wrong ?
One point why you do not see the dummies on the output is because they are linear dependent to the other data after the fixed-effect time transformation. They are dropped so what is estimable is estimated and output.
Find below some (not readily executable) code picking up your example from above:
dat <- cbind(EmplUK, mtx) # mtx being the dummy matrix constructed in your question's code for this data set
pdat <- pdata.frame(dat)
rhs <- paste(c("emp", "wage", "output", "dummy_1", "dummy_2", "dummy_3"), collapse = "+")
form <- paste("capital ~" , rhs)
form <- formula(form)
mod <- plm(form, data = pdat, model = "within", effect = "time")
detect.lindep(mod$model) # before FE time transformation (original data) -> nothing offending
detect.lindep(model.matrix(mod)) # after FE time transformation -> dummies are offending
The help page for detect.lindep (?detect.lindep is included in package plm) has some more nice examples on linear dependence before and after FE transformation.
A suggestion:
As for constructing dummy variables, I suggest to use R's factor with three levels and not have the dummy matrix constructed yourself. Using a factor is typically more convinient and less error prone. It is converted to the binary dummies (treatment style) by your typical estimation function using the model.frame/model.matrix framework.
I have an array of outputs from hundreds of segmented linear models (made using the segmented package in R). I want to be able to use these outputs on new data, using the predict function. To be clear, I do not have the segmented linear model objects in my workspace; I just saved and reimported the relevant outputs (e.g. the coefficients and breakpoints). For this reason I can't simply use the predict.segmented function from the segmented package.
Below is a toy example based on this link that seems promising, but does not match the output of the predict.segmented function.
library(segmented)
set.seed(12)
xx <- 1:100
zz <- runif(100)
yy <- 2 + 1.5*pmax(xx-35,0) - 1.5*pmax(xx-70,0) +
15*pmax(zz-0.5,0) + rnorm(100,0,2)
dati <- data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
o<-## S3 method for class 'lm':
segmented(out.lm,seg.Z=~x,psi=list(x=c(30,60)),
control=seg.control(display=FALSE))
# Note that coefficients with U in the name are differences in slopes, not slopes.
# Compare:
slope(o)
coef(o)[2] + coef(o)[3]
coef(o)[2] + coef(o)[3] + coef(o)[4]
# prediction
pred <- data.frame(x = 1:100)
pred$dummy1 <- pmax(pred$x - o$psi[1,2], 0)
pred$dummy2 <- pmax(pred$x - o$psi[2,2], 0)
pred$dummy3 <- I(pred$x > o$psi[1,2]) * (coef(o)[2] + coef(o)[3])
pred$dummy4 <- I(pred$x > o$psi[2,2]) * (coef(o)[2] + coef(o)[3] + coef(o)[4])
names(pred)[-1]<- names(model.frame(o))[-c(1,2)]
# compute the prediction, using standard predict function
# computing confidence intervals further
# suppose that the breakpoints are fixed
pred <- data.frame(pred, predict(o, newdata= pred,
interval="confidence"))
# Try prediction using the predict.segment version to compare
test <- predict.segmented(o)
plot(pred$fit, test, ylim = c(0, 100))
abline(0,1, col = "red")
# At least one segment not being predicted correctly?
Can I use the base r predict() function (not the segmented.predict() function) with the coefficients and break points saved from segmented linear models?
UPDATE
I figured out that the code above has issues (don't use it). Through some reverse engineering of the segmented.predict() function, I produced the design matrix and use that to predict values instead of directly using the predict() function. I do not consider this a full answer of the original question yet because predict() can also produce confidence intervals for the prediction, and I have not yet implemented that--question still open for someone to add confidence intervals.
library(segmented)
## Define function for making matrix of dummy variables (this is based on code from predict.segmented())
dummy.matrix <- function(x.values, x_names, psi.est = TRUE, nameU, nameV, diffSlope, est.psi) {
# This function creates a model matrix with dummy variables for a segmented lm with two breakpoints.
# Inputs:
# x.values: the x values of the segmented lm
# x_names: the name of the column of x values
# psi.est: this is legacy from the predict.segmented function, leave it set to 'TRUE'
# obj: the segmented lm object
# nameU: names (class character) of 3rd and 4th coef, which are "U1.x" "U2.x" for lm with two breaks. Example: names(c(obj$coef[3], obj$coef[4]))
# nameV: names (class character) of 5th and 6th coef, which are "psi1.x" "psi2.x" for lm with two breaks. Example: names(c(obj$coef[5], obj$coef[6]))
# diffSlope: the coefficients (class numeric) with the slope differences; called U1.x and U2.x for lm with two breaks. Example: c(o$coef[3], o$coef[4])
# est.psi: the estimated break points (class numeric); these are the estimated breakpoints from segmented.lm. Example: c(obj$psi[1,2], obj$psi[2,2])
#
n <- length(x.values)
k <- length(est.psi)
PSI <- matrix(rep(est.psi, rep(n, k)), ncol = k)
newZ <- matrix(x.values, nrow = n, ncol = k, byrow = FALSE)
dummy1 <- pmax(newZ - PSI, 0)
if (psi.est) {
V <- ifelse(newZ > PSI, -1, 0)
dummy2 <- if (k == 1)
V * diffSlope
else V %*% diag(diffSlope)
newd <- cbind(x.values, dummy1, dummy2)
colnames(newd) <- c(x_names, nameU, nameV)
} else {
newd <- cbind(x.values, dummy1)
colnames(newd) <- c(x_names, nameU)
}
# if (!x_names %in% names(coef(obj.seg)))
# newd <- newd[, -1, drop = FALSE]
return(newd)
}
## Test dummy matrix function----------------------------------------------
set.seed(12)
xx<-1:100
zz<-runif(100)
yy<-2+1.5*pmax(xx-35,0)-1.5*pmax(xx-70,0)+15*pmax(zz-.5,0)+rnorm(100,0,2)
dati<-data.frame(x=xx,y=yy,z=zz)
out.lm<-lm(y~x,data=dati)
#1 segmented variable, 2 breakpoints: you have to specify starting values (vector) for psi:
o<-segmented(out.lm,seg.Z=~x,psi=c(30,60),
control=seg.control(display=FALSE))
slope(o)
plot.segmented(o)
summary(o)
# Test dummy matrix fn with the same dataset
newdata <- dati
nameU1 <- c("U1.x", "U2.x")
nameV1 <- c("psi1.x", "psi2.x")
diffSlope1 <- c(o$coef[3], o$coef[4])
est.psi1 <- c(o$psi[1,2], o$psi[2,2])
test <- dummy.matrix(x.values = newdata$x, x_names = "x", psi.est = TRUE,
nameU = nameU1, nameV = nameV1, diffSlope = diffSlope1, est.psi = est.psi1)
# Predict response variable using matrix multiplication
col1 <- matrix(1, nrow = dim(test)[1])
test <- cbind(col1, test) # Now test is the same as model.matrix(o)
predY <- coef(o) %*% t(test)
plot(predY[1,])
lines(predict.segmented(o), col = "blue") # good, predict.segmented gives same answer
I am trying to use the lm.cluster function in the package miceadds to get robust clustered standard errors for a multiply imputed dataset.
I am able to get the standard version of it to run but I get the following error when I try to add a subset or weights:
Error in eval(substitute(subset), data, env) :
..1 used in an incorrect context, no ... to look in
Example that works without subset or weights:
require("mice")
require("miceadds")
data(data.ma01)
# imputation of the dataset: use six imputations
dat <- data.ma01[ , - c(1:2) ]
imp <- mice::mice( dat , maxit=3 , m=6 )
datlist <- miceadds::mids2datlist( imp )
# linear regression with cluster robust standard errors
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool )} )
# extract parameters and covariance matrix
betas <- lapply( mod , FUN = function(rr){ coef(rr) } )
vars <- lapply( mod , FUN = function(rr){ vcov(rr) } )
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
Example that breaks with subset:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool, subset=
(data.ma01$urban==1))} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
Example that breaks with weights:
mod <- lapply(datlist, FUN = function(data){miceadds::lm.cluster( data=data ,
formula=read ~ paredu+ female , cluster = data.ma01$idschool,
weights=data.ma01$studwgt)} )
Error during wrapup: ..1 used in an incorrect context, no ... to look in
From searching, I think I am encountering similar issues as others when passing these commands through an lm or glm wrapper (such as: Passing Argument to lm in R within Function or R : Pass argument to glm inside an R function or Passing the weights argument to a regression function inside an R function)
However, I am not sure how to address the issue with the imputed datasets & existing lm.cluster command.
Thanks
This works fine with the estimatr package which is on CRAN and the estimatr::lm_robust() function. Two notes: (1) you can change the type of standard errors using se_type = and (2) I keep idschool in the data because we like the clusters to be in the same data.frame as we fit the model on.
library(mice)
library(miceadds)
library(estimatr)
# imputation of the dataset: use six imputations
data(data.ma01)
dat <- data.ma01[, -c(1)] # note I keep idschool in data
imp <- mice::mice( dat , maxit = 3, m = 6)
datlist <- miceadds::mids2datlist(imp)
# linear regression with cluster robust standard errors
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool)
}
)
# subset
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, subset = urban == 1)
}
)
# weights
mod <- lapply(
datlist,
function (dat) {
estimatr::lm_robust(read ~ paredu + female, dat, clusters = idschool, weights = studwgt)
}
)
# note that you can use the `se_type` argument of lm_robust()
# to change the vcov estimation
# extract parameters and covariance matrix
betas <- lapply(mod, coef)
vars <- lapply(mod, vcov)
# conduct statistical inference
summary(pool_mi( qhat = betas, u = vars ))
I'm no expert, but there is an issue with the passing of the weights to lm(). I know this is not an ideal situation, but I managed to get it to work by modifying the lm.cluster() function to hard code the weights pass and then just used my own.
lm.cluster <- function (data, formula, cluster, wgts=NULL, ...)
{
TAM::require_namespace_msg("multiwayvcov")
if(is.null(wgts)) {
mod <- stats::lm(data = data, formula = formula)
} else {
data$.weights <- wgts
mod <- stats::lm(data = data, formula = formula, weights=data$.weights)
}
if (length(cluster) > 1) {
v1 <- cluster
}
else {
v1 <- data[, cluster]
}
dfr <- data.frame(cluster = v1)
vcov2 <- multiwayvcov::cluster.vcov(model = mod, cluster = dfr)
res <- list(lm_res = mod, vcov = vcov2)
class(res) <- "lm.cluster"
return(res)
}
I built a model using ucm function. However when i try to forecast for future, it's not letting me pass the independent variables.
library("rucm")
library("lubridate")
#Create Simulated Data
Date<- as.Date(seq(from=as.Date('2012-01-01'),to=as.Date('2014-03-31'),by=1),"%Y-%m-%d")
Actual <- sample(27:65,length(Date),replace = TRUE)
DOW <- wday(Date)
Month <- month(Date)
DOM <- mday(Date)
Week <- week(Date)
Ya <- year(Date)
Ya <- ifelse(Ya=="2014",2,1)
a <- data.frame(Date,Actual,DOW,Month,DOM,Week,Ya)
a$Date<-as.Date(a$Date,"%Y-%m-%d")
abc <- cbind(Weekday=model.matrix(~as.factor(a$DOW)),
Mont=model.matrix(~as.factor(a$Month)),
Day=model.matrix(~as.factor(a$DOM)),a[,7,drop=FALSE],
Weekofyear=model.matrix(~as.factor(a$Week)))
abc<-data.frame(abc)
abc<-data.frame(abc[,c(-1,-8,-20,-52)])
abc2 <- subset(abc,abc$Ya==1)
abc2 <- abc2[,-48]
abc3 <- subset(abc,abc$Ya==2)
abc3 <- abc3[,-48]
#train and insample MAPE
a1<-subset(a,a$Ya==1)
a2<-subset(a,a$Ya==2)
#build model
dat <- as.data.frame(cbind(a1[,2,drop=FALSE], abc2))
fo <- as.formula(paste("Actual ~ ", paste(names(dat)[2:42], collapse= "+")))
fit_train_ucm <- ucm(fo, data = dat, cycle = TRUE, cycle.period = 365)
#predict for future
predict(fit_train_ucm,n.ahead = 90,newdata = abc3)
i am getting this error
Error in is.SSModel(newdata, na.check = TRUE, return.logical = FALSE) :
Object is not of class 'SSModel'
Update:
Based on suggestion by package author from GitHub, i used below code(sorry to say this is not very self explanatory, i did however try to edit code)
SSModel(rep(NA,nrow(abc3)) ~ x + SSMtrend(2, Q = list(fit_train_ucm$est.var.level, fit_train_ucm$est.var.slope)) + SSMseasonal(12, Q = fit_train_ucm$est.var.season), H = fit_train_ucm$irr.var, data=abc3)
Now the error message is
Error in eval(expr, envir, enclos) : object 'x' not found
Any help on this?
This bug of prediction with newdata in ucm is yet to corrected.
This is how you can get predictions for an out-of-sample period.
indep <- paste(names(dat)[2:42], collapse= "+")
newdata1 <- SSModel(as.formula(paste0("rep(NA,nrow(abc3)) ~ ", indep, "+ SSMtrend(1, Q = list(fit_train_ucm$est.var.level))",
"+ SSMcycle(365, Q = fit_train_ucm$est.var.cycle)")), H = fit_train_ucm$irr.var, data=abc3)
pred<-predict(fit_train_ucm$model, newdata=newdata1)
So basically here I am using the predict function from KFAS package. To use the predict function, I have to define the data to be an object of class SSModel.
You will take all the parameters that you used in the model as independent variables and run a State space model with NA's as independent variable.
P.S.: I will keep the comments in mind to write a better answer next time.
Just so someone lands at this page again in future - there has been some update on the git repo maintained by the author to approach this problem in simplistic way. Please check this commit
Alternatively, pasting the function which one may use from that commit:
predict.ucm <- function(object, n.ahead, newdata,...){
#### Predict in sample ####
if (missing(newdata)) {
return(predict(object = object$model, n.ahead = n.ahead))
}
#### Predict out of sample ####
# Regression variables.
model_variables <- paste0(names(object$est), collapse = " + ")
# Trend
# Case 1 no trend
if (is.null(object$est.var.level) & is.null(object$est.var.slope)) {
model_trend <- ""
}
# Case 2 level and trend
if (!is.null(object$est.var.level) & !is.null(object$est.var.slope)) {
model_trend <- "+ SSMtrend(degree = 2, Q = list(object$est.var.level, object$est.var.slope))"
}
# Case 3 level only / trend only is not allowed in R
if (!is.null(object$est.var.level) & is.null(object$est.var.slope)) {
model_trend <- "+ SSMtrend(degree = 1, Q = list(object$est.var.level))"
}
# Seasonality
if (!is.null(object$est.var.season)) {
model_season <- sprintf("+ SSMseasonal(period = %s, Q = object$est.var.season)",
object$call['season.length'] %>% as.character())
} else {
model_season <- ""
}
# Cycle
if (!is.null(object$est.var.cycle)) {
model_cycle <- sprintf("+ SSMcycle(period = %s, Q = object$est.var.cycle)",
object$call['cycle.period'] %>% as.character())
} else {
model_cycle <- ""
}
# Combine all components into a formula
model_formula <- as.formula(sprintf("rep(NA,nrow(newdata)) ~ %s %s %s %s",
model_variables,
model_trend,
model_season,
model_cycle
))
# Build a SSM object for the prediction
oos_data <- KFAS::SSModel(formula = model_formula, H = object$irr.var, data = newdata)
# Return the predictions
predict(object$model, newdata = oos_data)
}
EDIT
In case you are interested in predicting on a new data which is considered as for the same time period as the original model was built using, then following edit can be made in the above function. That new data should ideally include the same dependent variable values which was used to build the first model.
# Using same dependent which was used initially to fit the estimates, instead of NA
model_formula <- as.formula(paste0(
object$model$terms[[2]],
sprintf(
" ~ %s %s %s %s",
model_variables,
model_trend,
model_season,
model_cycle
)
))
# Build a SSM object for the prediction
oos_data <- KFAS::SSModel(formula = model_formula, H = object$irr.var, data = newdata)
# Return the predictions for in sample only - hence removing newdata arg
# predict(object$model, newdata = oos_data)
predict(oos_data)
I want use survfit() and basehaz() inside a function, but they do not work. Could you take a look at this problem. Thanks for your help. The following code leads to the error:
library(survival)
n <- 50 # total sample size
nclust <- 5 # number of clusters
clusters <- rep(1:nclust,each=n/nclust)
beta0 <- c(1,2)
set.seed(13)
#generate phmm data set
Z <- cbind(Z1=sample(0:1,n,replace=TRUE),
Z2=sample(0:1,n,replace=TRUE),
Z3=sample(0:1,n,replace=TRUE))
b <- cbind(rep(rnorm(nclust),each=n/nclust),rep(rnorm(nclust),each=n/nclust))
Wb <- matrix(0,n,2)
for( j in 1:2) Wb[,j] <- Z[,j]*b[,j]
Wb <- apply(Wb,1,sum)
T <- -log(runif(n,0,1))*exp(-Z[,c('Z1','Z2')]%*%beta0-Wb)
C <- runif(n,0,1)
time <- ifelse(T<C,T,C)
event <- ifelse(T<=C,1,0)
mean(event)
phmmd <- data.frame(Z)
phmmd$cluster <- clusters
phmmd$time <- time
phmmd$event <- event
fmla <- as.formula("Surv(time, event) ~ Z1 + Z2")
BaseFun <- function(x){
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
}
BaseFun(fmla)
Error in formula.default(object, env = baseenv()) : invalid formula
But the following function works:
fit <- coxph(fmla, phmmd)
basehaz(fit)
It is a problem of scoping.
Notice that the environment of basehaz is:
environment(basehaz)
<environment: namespace:survival>
meanwhile:
environment(BaseFun)
<environment: R_GlobalEnv>
Therefore that is why the function basehaz cannot find the local variable inside the function.
A possible solution is to send x to the top using assign:
BaseFun <- function(x){
assign('x',x,pos=.GlobalEnv)
start.coxph <- coxph(x, phmmd)
print(start.coxph)
betahat <- start.coxph$coefficient
print(betahat)
print(333)
print(survfit(start.coxph))
m <- basehaz(start.coxph)
print(m)
rm(x)
}
BaseFun(fmla)
Other solutions may involved dealing with the environments more directly.
I'm following up on #moli's comment to #aatrujillob's answer. They were helpful so I thought I would explain how it solved things for me and a similar problem with the rpart and partykit packages.
Some toy data:
N <- 200
data <- data.frame(X = rnorm(N),W = rbinom(N,1,0.5))
data <- within( data, expr = {
trtprob <- 0.4 + 0.08*X + 0.2*W -0.05*X*W
Trt <- rbinom(N, 1, trtprob)
outprob <- 0.55 + 0.03*X -0.1*W - 0.3*Trt
Outcome <- rbinom(N,1,outprob)
rm(outprob, trtprob)
})
I want to split the data to training (train_data) and testing sets, and train the classification tree on train_data.
Here's the formula I want to use, and the issue with the following example. When I define this formula, the train_data object does not yet exist.
my_formula <- Trt~W+X
exists("train_data")
# [1] FALSE
exists("train_data", envir = environment(my_formula))
# [1] FALSE
Here's my function, which is similar to the original function. Again,
badFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
Trying to run this function throws an error similar to OP's.
library(rpart)
library(partykit)
bad_out <- badFunc(data=data, my_formula = my_formula)
# Error in is.data.frame(data) : object 'train_data' not found
# 10. is.data.frame(data)
# 9. model.frame.default(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 8. stats::model.frame(formula = Trt ~ W + X, data = train_data,
# na.action = function (x) {Terms <- attr(x, "terms") ...
# 7. eval(expr, envir, enclos)
# 6. eval(mf, env)
# 5. model.frame.rpart(obj)
# 4. model.frame(obj)
# 3. as.party.rpart(ct_train)
# 2. partykit::as.party(ct_train)
# 1. badFunc(data = data, my_formula = my_formula)
print(bad_out)
# Error in print(bad_out) : object 'bad_out' not found
Luckily, rpart() is like coxph() in that you can specify the argument model=TRUE to solve these issues. Here it is again, with that extra argument.
goodFunc <- function(data, my_formula){
train_data <- data[1:100,]
ct_train <- rpart::rpart(
data= train_data,
## This solved it for me
model=TRUE,
##
formula = my_formula,
method = "class")
ct_party <- partykit::as.party(ct_train)
}
good_out <- goodFunc(data=data, my_formula = my_formula)
print(good_out)
# Model formula:
# Trt ~ W + X
#
# Fitted party:
# [1] root
# | [2] X >= 1.59791: 0.143 (n = 7, err = 0.9)
##### etc
documentation for model argument in rpart():
model:
if logical: keep a copy of the model frame in the result? If
the input value for model is a model frame (likely from an earlier
call to the rpart function), then this frame is used rather than
constructing new data.
Formulas can be tricky as they use lexical scoping and environments in a way that is not always natural (to me). Thank goodness Terry Therneau has made our lives easier with model=TRUE in these two packages!