Context
I would like to customize a function that uses non-standard evaluation (NSE) to fit the cox model. It works fine when I fit the cox model using the code.
But when I wrap the code into a custom function using NSE it reports an error.
Question
I don't know why using NSE in a custom function myfun is reporting an error and how to fix this error.
Reproducible code
library(survival)
library(rms)
data(cancer)
# works fine in normal code
dd = datadist(lung)
options(datadist = 'dd')
fit1 = cph(Surv(time, status) ~ rcs(meal.cal), data = lung)
fit1
# report an error in self-define function using non-standard evaluation
myfun(data, var){
dd = datadist(data)
options(datadist = 'dd')
fit1 = cph(Surv(time, status) ~ rcs({{var}}), data = data)
fit1
}
Maybe you're looking for this?
myfun1 <- \(data, var) {
# dd <- datadist(data)
# options(datadist='dd') ## what is this doing? I see no difference
fo <- as.formula(paste('Surv(time, status) ~ rcs(', var, ')'))
do.call('cph', list(fo, quote(data)))
}
myfun2 <- \(data, var) {
fo <- as.formula(paste('Surv(time, status) ~ rcs(', deparse(substitute(var)), ')'))
do.call('cph', list(fo, quote(data)))
}
myfun1(lung, 'meal.cal') ## provide var as string
myfun2(lung, meal.cal) ## provide var as name
# Frequencies of Missing Values Due to Each Variable
# Surv(time, status) meal.cal
# 0 47
#
# Cox Proportional Hazards Model
#
# cph(formula = Surv(time, status) ~ rcs(meal.cal), data = data)
#
#
# Model Tests Discrimination
# Indexes
# Obs 181 LR chi2 0.72 R2 0.004
# Events 134 d.f. 4 R2(4,181)0.000
# Center -0.3714 Pr(> chi2) 0.9485 R2(4,134)0.000
# Score chi2 0.76 Dxy 0.048
# Pr(> chi2) 0.9443
#
# Coef S.E. Wald Z Pr(>|Z|)
# meal.cal -0.0006 0.0013 -0.48 0.6299
# meal.cal' 0.0007 0.0051 0.14 0.8860
# meal.cal'' 0.0010 0.0261 0.04 0.9682
# meal.cal''' -0.0132 0.0676 -0.19 0.8456
Related
Context
I have a custom function myfun1 that fits the cox model. Before fitting the model, I need to do a bit of processing on the data used to fit the model. Specifically, run two lines of code, dd = datadist(data) and options(datadist = 'dd').
If dd exists in the environment inside the function, myfun1 will report an error.
But when I output dd to the global environment, myfun2 works fine.
Question
Why does this happen?
How can I get myfun1 to run properly while keeping dd inside the function?
Reproducible code
library(survival)
library(rms)
data(cancer)
myfun1 <- function(data, x){
x = sym(x)
dd = datadist(data)
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun1(dat = lung, x = 'meal.cal')
# Error in Design(data, formula, specials = c("strat", "strata")) :
# dataset dd not found for options(datadist=)
myfun2 <- function(data, x){
x = sym(x)
dd <<- datadist(data) # Changed here compared to myfun1
options(datadist = 'dd')
fit = rlang::inject(cph(Surv(time, status) ~ rcs(!!x), data = data))
fit
}
myfun2(dat = lung, x = 'meal.cal')
# Frequencies of Missing Values Due to Each Variable
# Surv(time, status) meal.cal
# 0 47
#
# Cox Proportional Hazards Model
#
# cph(formula = Surv(time, status) ~ rcs(meal.cal), data = data)
#
#
# Model Tests Discrimination
# Indexes
# Obs 181 LR chi2 0.72 R2 0.004
# Events 134 d.f. 4 R2(4,181)0.000
# Center -0.3714 Pr(> chi2) 0.9485 R2(4,134)0.000
# Score chi2 0.76 Dxy 0.048
# Pr(> chi2) 0.9443
I'm trying to create my first function in R to calculate moderation analysis.
But now a problem accured I cannot solve :/
When I run my function, I don't get any output..
I also tried print() and return() with the same result.
Any recommendations?
Moderation <- function(Mod, UV, AV) {
meanUV <- mean(UV, na.rm = TRUE)
sdUV <- sd(UV, na.rm = TRUE)
ZUV <-((UV - meanUV)/sdUV)
meanMod <- mean(Mod, na.rm = TRUE)
sdMod <- sd(Mod, na.rm = TRUE)
ZMod <- (Mod - meanMod)/sdMod
Interaktion <- ZUV*ZAV
Moderation.fit <- 'AV ~ ZUV + ZMod + Interaktion'
summary(sem(model = Moderation.fit, data = MyData, meanstructure = TRUE))
}
Moderation(MyData$SKK, MyData$ZDT2, MyData$HO4)
Thank you for your help!
The problem in your function is that you refer to MyData in the last line even though you do not pass this dataset to the function. Actually, the function does look into your global environment and finds MyData there. But your z-scored variables and Interaktion are not included in this version of MyData, so the function reports an error about "missing variables in dataset".
It works with the following changes:
Moderation <- function(Mod, UV, AV) {
require(lavaan) #This ensures the function works even if you did not do library(lavaan) first.
ZMod <- scale(Mod) #I simplified stuff here. The "scale" function is equivalent to what you did.
ZUV <- scale(UV)
Interaktion <- ZUV*ZMod
MyData <- as.data.frame(list(ZMod=ZMod, ZUV=ZUV, AV=AV, Interaktion=Interaktion)) #Here, MyData is generated so that you can use it further below
Moderation.fit <- 'AV ~ ZUV + ZMod + Interaktion' #nothing changed here
summary(sem(model = Moderation.fit, data = MyData, meanstructure = TRUE))
}
I tested it with the popular mtcars dataset:
data(mtcars)
Moderation(mtcars$am, mtcars$hp, mtcars$mpg)
##lavaan 0.6-7 ended normally after 28 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of free parameters 5
##
## Number of observations 32
##
##Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
##Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
##Regressions:
## Estimate Std.Err z-value P(>|z|)
## AV ~
## ZUV -4.043 0.560 -7.225 0.000
## ZMod 2.633 0.513 5.134 0.000
## Interaktion 0.014 0.527 0.026 0.979
##
##Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .AV 20.094 0.505 39.785 0.000
##
##Variances:
## Estimate Std.Err z-value P(>|z|)
## .AV 7.670 1.917 4.000 0.000
Viel Erfolg und Spass damit!
I'm following up on this great answer. Function foo below, takes the Name column of VarCorr(fit) output and makes them the column names for summary(rePCA(fit)) call.
It works fine when we input fm1, fm2, but I wonder why it fails for fm3? Is there a fix?
library(lme4)
dat <- read.csv('https://raw.githubusercontent.com/rnorouzian/e/master/sng.csv')
fm1 <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin)
fm2 <- lmer(Reaction ~ Days + (Days | Subject), sleepstudy)
fm3 <- lmer(y ~ A * B * C + (A + B | group) + (C|group), data = dat)
foo <- function(fit) {
obj <- summary(rePCA(fit))
model <- VarCorr(fit)
Map(function(x, z) {
colnames(x$importance) <- paste(z, unique(sapply(model, colnames)), sep = '_')
x
}, obj, names(obj))
}
#EXAMPLE OF USE:
foo(fm1) ### OK !
foo(fm2) ### OK !
foo(fm3) ### :-( Error in dimnames(x) <- dn
The function fails when length of obj and model is different. Here is a hack to make it work for fm3.
foo <- function(fit) {
obj <- summary(rePCA(fit))
model <- VarCorr(fit)
if(length(obj) == length(model)) {
obj <- Map(function(x, z) {
colnames(x$importance) <- paste(z, unique(sapply(model, colnames)), sep = '_')
x
}, obj, names(obj))
}
else if(length(obj) == 1) {
colnames(obj[[1]]$importance) <- unlist(mapply(paste, names(model), sapply(model, colnames), MoreArgs = list(sep = '_')))
}
return(obj)
}
This returns the following output :
foo(fm1)
#$plate
#Importance of components:
# plate_(Intercept)
#Standard deviation 1.54
#Proportion of Variance 1.00
#Cumulative Proportion 1.00
#$sample
#Importance of components:
# sample_(Intercept)
#Standard deviation 3.51
#Proportion of Variance 1.00
#Cumulative Proportion 1.00
foo(fm2)
#$Subject
#Importance of components:
# Subject_(Intercept) Subject_Days
#Standard deviation 0.967 0.2309
#Proportion of Variance 0.946 0.0539
#Cumulative Proportion 0.946 1.0000
foo(fm3)
#$group
#Importance of components:
# group_(Intercept) group_A group_B group.1_(Intercept) group.1_C
#Standard deviation 1.418 1.291 0.5129 0.4542 0.0000497
#Proportion of Variance 0.485 0.402 0.0634 0.0498 0.0000000
#Cumulative Proportion 0.485 0.887 0.9502 1.0000 1.0000000
You could get the column names from 'fit#cnms', which saves you the trouble of using 'VarCorr'. The devil seems to be the case of fm1 which gives a list output and we may want to coerce as.data.frame. Then we may just use `colnames<-` and no Map battles are needed.
foo2 <- function(fit) {
obj <- summary(rePCA(fit))
obj <- as.data.frame(lapply(obj, `[`, "importance"))
`colnames<-`(obj, paste(names(fit#cnms), unlist(fit#cnms), sep="_"))
}
foo2(fm1)
# plate_(Intercept) sample_(Intercept)
# Standard deviation 1.539676 3.512519
# Proportion of Variance 1.000000 1.000000
# Cumulative Proportion 1.000000 1.000000
foo2(fm2)
# Subject_(Intercept) Subject_Days
# Standard deviation 0.966868 0.2308798
# Proportion of Variance 0.946050 0.0539500
# Cumulative Proportion 0.946050 1.0000000
foo2(fm3)
# group_(Intercept) group_A group_B group_(Intercept) group_C
# Standard deviation 1.385987 1.322335 0.5128262 0.4547251 0.08892506
# Proportion of Variance 0.463190 0.421630 0.0634100 0.0498600 0.00191000
# Cumulative Proportion 0.463190 0.884820 0.9482300 0.9980900 1.00000000
Another advantage I find is that the numbers come out unrounded. You could build in a rounding in the function or do that afterwards:
round(foo2(fm1), 2)
# plate_(Intercept) sample_(Intercept)
# Standard deviation 1.54 3.51
# Proportion of Variance 1.00 1.00
# Cumulative Proportion 1.00 1.00
I have a balanced panel data set, df, that essentially consists in three variables, A, B and Y, that vary over time for a bunch of uniquely identified regions. I would like to run a regression that includes both regional (region in the equation below) and time (year) fixed effects. If I'm not mistaken, I can achieve this in different ways:
lm(Y ~ A + B + factor(region) + factor(year), data = df)
or
library(plm)
plm(Y ~ A + B,
data = df, index = c('region', 'year'), model = 'within',
effect = 'twoways')
In the second equation I specify indices (region and year), the model type ('within', FE), and the nature of FE ('twoways', meaning that I'm including both region and time FE).
Despite I seem to be doing things correctly, I get extremely different results. The problem disappears when I do not consider time fixed effects - and use the argument effect = 'individual'.
What's the deal here? Am I missing something? Are there any other R packages that allow to run the same analysis?
Perhaps posting an example of your data would help answer the question. I am getting the same coefficients for some made up data. You can also use felm from the package lfe to do the same thing:
N <- 10000
df <- data.frame(a = rnorm(N), b = rnorm(N),
region = rep(1:100, each = 100), year = rep(1:100, 100))
df$y <- 2 * df$a - 1.5 * df$b + rnorm(N)
model.a <- lm(y ~ a + b + factor(year) + factor(region), data = df)
summary(model.a)
# (Intercept) -0.0522691 0.1422052 -0.368 0.7132
# a 1.9982165 0.0101501 196.866 <2e-16 ***
# b -1.4787359 0.0101666 -145.450 <2e-16 ***
library(plm)
pdf <- pdata.frame(df, index = c("region", "year"))
model.b <- plm(y ~ a + b, data = pdf, model = "within", effect = "twoways")
summary(model.b)
# Coefficients :
# Estimate Std. Error t-value Pr(>|t|)
# a 1.998217 0.010150 196.87 < 2.2e-16 ***
# b -1.478736 0.010167 -145.45 < 2.2e-16 ***
library(lfe)
model.c <- felm(y ~ a + b | factor(region) + factor(year), data = df)
summary(model.c)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# a 1.99822 0.01015 196.9 <2e-16 ***
# b -1.47874 0.01017 -145.4 <2e-16 ***
This does not seem to be a data issue.
I'm doing computer exercises in R from Wooldridge (2012) Introductory Econometrics. Specifically Chapter 14 CE.1 (data is the rental file at: https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041)
I computed the model in differences (in Python)
model_diff = smf.ols(formula='diff_lrent ~ diff_lpop + diff_lavginc + diff_pctstu', data=rental).fit()
OLS Regression Results
==============================================================================
Dep. Variable: diff_lrent R-squared: 0.322
Model: OLS Adj. R-squared: 0.288
Method: Least Squares F-statistic: 9.510
Date: Sun, 05 Nov 2017 Prob (F-statistic): 3.14e-05
Time: 00:46:55 Log-Likelihood: 65.272
No. Observations: 64 AIC: -122.5
Df Residuals: 60 BIC: -113.9
Df Model: 3
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
Intercept 0.3855 0.037 10.469 0.000 0.312 0.459
diff_lpop 0.0722 0.088 0.818 0.417 -0.104 0.249
diff_lavginc 0.3100 0.066 4.663 0.000 0.177 0.443
diff_pctstu 0.0112 0.004 2.711 0.009 0.003 0.019
==============================================================================
Omnibus: 2.653 Durbin-Watson: 1.655
Prob(Omnibus): 0.265 Jarque-Bera (JB): 2.335
Skew: 0.467 Prob(JB): 0.311
Kurtosis: 2.934 Cond. No. 23.0
==============================================================================
Now, the PLM package in R gives the same results for the first-difference models:
library(plm) modelfd <- plm(lrent~lpop + lavginc + pctstu,
data=data,model = "fd")
No problem so far. However, the fixed effect reports different estimates.
modelfx <- plm(lrent~lpop + lavginc + pctstu, data=data, model =
"within", effect="time") summary(modelfx)
The FE results should not be any different. In fact, the Computer Exercise question is:
(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors to those in part (iii).
My best guest is that I am miss understanding something on the R package.
I have a list for formulas I want to fit to data, rather than running a loop I'd like to do this at once, for performance's sake. The estimations should still be separate, I'm not trying to estimate a SUR or anything.
The following code does what I want
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
formulae <-list(y~x[,1],
y~x[,2],
y~x[,1] + x[,2])
lapply(formulae,lm)
Unfortunately this gets somewhat slow as the length of formulae increases is there a way to truly vectorize this?
If it is any help, the only results of lm I care about are coefficients, and some standard errors.
As I said in my comment, what you really need is a more efficient yet stable fitting routine other than lm(). Here I would provide you a well tested one written myself, called lm.chol(). It takes a formula and data, and returns:
a coefficient summary table, as you normally see in summary(lm(...))$coef;
Pearson estimate of residual standard error, as you get from summary(lm(...))$sigma;
adjusted-R.squared, as you get from summary(lm(...))$adj.r.squared.
## linear model estimation based on pivoted Cholesky factorization with Jacobi preconditioner
lm.chol <- function(formula, data) {
## stage0: get response vector and model matrix
## we did not follow the normal route: match.call, model.frame, model.response, model matrix, etc
y <- data[[as.character(formula[[2]])]]
X <- model.matrix(formula, data)
n <- nrow(X); p <- ncol(X)
## stage 1: XtX and Jacobi diagonal preconditioner
XtX <- crossprod(X)
D <- 1 / sqrt(diag(XtX))
## stage 2: pivoted Cholesky factorization
R <- suppressWarnings(chol(t(D * t(D * XtX)), pivot = TRUE))
piv <- attr(R, "pivot")
r <- attr(R, "rank")
if (r < p) {
warning("Model is rank-deficient!")
piv <- piv[1:r]
R <- R[1:r, 1:r]
}
## stage 3: solve linear system for coefficients
D <- D[piv]
b <- D * crossprod(X, y)[piv]
z <- forwardsolve(t(R), b)
RSS <- sum(y * y) - sum(z * z)
sigma <- sqrt(RSS / (n - r))
para <- D * backsolve(R, z)
beta.hat <- rep(NA, p)
beta.hat[piv] <- para
## stage 4: get standard error
Rinv <- backsolve(R, diag(r))
se <- rep(NA, p)
se[piv] <- D * sqrt(rowSums(Rinv * Rinv)) * sigma
## stage 5: t-statistic and p-value
t.statistic <- beta.hat / se
p.value <- 2 * pt(-abs(t.statistic), df = n - r)
## stage 6: construct coefficient summary matrix
coefficients <- matrix(c(beta.hat, se, t.statistic, p.value), ncol = 4L)
colnames(coefficients) <- c("Estimate", "Std. Error", "t value", "Pr(>|t|)")
rownames(coefficients) <- colnames(X)
## stage 7: compute adjusted R.squared
adj.R2 <- 1 - sigma * sigma / var(y)
## return model fitting results
attr(coefficients, "sigma") <- sigma
attr(coefficients, "adj.R2") <- adj.R2
coefficients
}
Here I would offer three examples.
Example 1: full rank linear model
We take R's built-in dataset trees as an example.
# using `lm()`
summary(lm(Height ~ Girth + Volume, trees))
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2958 9.0866 9.167 6.33e-10 ***
#Girth -1.8615 1.1567 -1.609 0.1188
#Volume 0.5756 0.2208 2.607 0.0145 *
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 5.056 on 28 degrees of freedom
#Multiple R-squared: 0.4123, Adjusted R-squared: 0.3703
#F-statistic: 9.82 on 2 and 28 DF, p-value: 0.0005868
## using `lm.chol()`
lm.chol(Height ~ Girth + Volume, trees)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 83.2957705 9.0865753 9.166905 6.333488e-10
#Girth -1.8615109 1.1566879 -1.609346 1.187591e-01
#Volume 0.5755946 0.2208225 2.606594 1.449097e-02
#attr(,"sigma")
#[1] 5.056318
#attr(,"adj.R2")
#[1] 0.3702869
The results are exactly the same!
Example 2: rank-deficient linear model
## toy data
set.seed(0)
dat <- data.frame(y = rnorm(100), x1 = runif(100), x2 = rbeta(100,3,5))
dat$x3 <- with(dat, (x1 + x2) / 2)
## using `lm()`
summary(lm(y ~ x1 + x2 + x3, dat))
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164 0.2530 0.856 0.394
#x1 -0.1526 0.3252 -0.469 0.640
#x2 -0.3534 0.5707 -0.619 0.537
#x3 NA NA NA NA
#Residual standard error: 0.8886 on 97 degrees of freedom
#Multiple R-squared: 0.0069, Adjusted R-squared: -0.01358
#F-statistic: 0.337 on 2 and 97 DF, p-value: 0.7147
## using `lm.chol()`
lm.chol(y ~ x1 + x2 + x3, dat)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2164455 0.2529576 0.8556595 0.3942949
#x1 NA NA NA NA
#x2 -0.2007894 0.6866871 -0.2924030 0.7706030
#x3 -0.3051760 0.6504256 -0.4691944 0.6399836
#attr(,"sigma")
#[1] 0.8886214
#attr(,"adj.R2")
#[1] -0.01357594
#Warning message:
#In lm.chol(y ~ x1 + x2 + x3, dat) : Model is rank-deficient!
Here, lm.chol() based on Cholesky factorization with complete pivoting and lm() based on QR factorization with partial pivoting have shrunk different coefficients to NA. But two estimation are equivalent, with the same fitted values and residuals.
Example 3: performance for large linear models
n <- 10000; p <- 300
set.seed(0)
dat <- as.data.frame(setNames(replicate(p, rnorm(n), simplify = FALSE), paste0("x",1:p)))
dat$y <- rnorm(n)
## using `lm()`
system.time(lm(y ~ ., dat))
# user system elapsed
# 3.212 0.096 3.315
## using `lm.chol()`
system.time(lm.chol(y ~ ., dat))
# user system elapsed
# 1.024 0.028 1.056
lm.chol() is 3 ~ 4 times faster than lm(). If you want to know the reason, read my this answer.
Remark
I have focused on improving performance on computational kernel. You can take one step further, by using Ben Bolker's parallelism suggestion. If my approach gives 3 times boost, and parallel computing gives 3 times boost on 4 cores, you end up with 9 times boost!
There's not really an easy way to vectorize this, but the pdredge function from the MuMIn package gives you a pretty easy way to parallelize it (this assumes you have multiple cores on your machine or that you can set up a local cluster in one of the ways supported by the parallel package ...
library(parallel)
clust <- makeCluster(2,"PSOCK")
library(MuMIn)
Construct data:
set.seed(101)
x <- matrix(rnorm(300),ncol=3)
y <- x %*% c(1,2,3)+rnorm(100)
It will be easier to do this with a named data frame rather than an anonymous matrix:
df <- setNames(data.frame(y,x),c("y",paste0("x",1:3)))
The cluster nodes all need access to the data set:
clusterExport(clust,"df")
Fit the full model (you could use y~. to fit all variables)
full <- lm(y~x1+x2,data=df,na.action=na.fail)
Now fit all submodels (see ?MuMIn::dredge for many more options to control which submodels are fitted)
p <- pdredge(full,cluster=clust)
coef(p)
## (Intercept) x1 x2
## 3 -0.003805107 0.7488708 2.590204
## 2 -0.028502039 NA 2.665305
## 1 -0.101434662 1.0490816 NA
## 0 -0.140451160 NA NA