Is there a function to extract Y from an lm object?
I use residual(m) and predict(m) but am using the object internal structures to extract Y...
m = lm(Y ~ X1, d)
head(m$model$Y)
[1] -0.791214 -1.291986 -0.472839 1.940940 -0.977910 -1.705539
You could use model.frame(), like the following:
# From the stats::lm documentation
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm1 <- lm(weight ~ group)
model.frame(lm1)$weight
## [1] 4.17 5.58 5.18 6.11 4.50 4.61 5.17 4.53 5.33 5.14 4.81 4.17 4.41
## 3.59 5.87 3.83 6.03 4.89 4.32 4.69
If you call a function on one or more of the variables in your formula, like
lm2 <- lm(log(weight) ~ group)
You can get the untransformed values with get_all_vars(lm2)$weight (model.frame() returns the transformed values).
If you want to see what functions (particularly extractor functions) are available for a particular class, you can check using methods(class = "lm") (or whatever object class you're interested in).
Related
I have imputed data saved as a mids object and am trying to adapt my usual workflow around imputed data. However, I cannot figure out how to use sjPlot's tab_corr() and tab_df() and psych's describe on a mids object.
My goal is to generate a table of descriptive statistics and a correlation matrix without averaging the imputed datasets together. I was able to generate correlations using miceadds::micombine.cor, but the output isn't formatted like a typical correlation matrix. I also can individually compute means, SDs, etc. of variables from the mids object, but I'm looking for something that will generate a table.
library(mice)
library(miceadds)
library(sjPlot)
library(tidyverse)
library(psych)
set.seed(123)
## correlation matrix
data(nhanes)
imp <- mice(nhanes, print = FALSE)
head(micombine.cor(mi.res = imp)) # ugly
#> variable1 variable2 r rse fisher_r fisher_rse fmi
#> 1 age bmi -0.38765907 0.1899398 -0.40904214 0.2234456 0.09322905
#> 2 age hyp 0.51588273 0.1792162 0.57071301 0.2443348 0.25939786
#> 3 age chl 0.37685482 0.2157535 0.39638877 0.2515615 0.30863126
#> 4 bmi hyp -0.01748158 0.2244419 -0.01748336 0.2245067 0.10249784
#> 5 bmi chl 0.29082393 0.2519295 0.29946608 0.2752862 0.44307791
#> 6 hyp chl 0.30271060 0.1984525 0.31250096 0.2185381 0.04935528
#> t p lower95 upper95
#> 1 -1.83061192 0.06715849 -0.68949235 0.0288951
#> 2 2.33578315 0.01950255 0.09156846 0.7816509
#> 3 1.57571320 0.11509191 -0.09636276 0.7111171
#> 4 -0.07787455 0.93792784 -0.42805131 0.3990695
#> 5 1.08783556 0.27666771 -0.23557593 0.6852881
#> 6 1.42996130 0.15272813 -0.11531056 0.6296450
data(iris)
iris %>%
select(-c(Species)) %>%
tab_corr() # pretty
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length -0.118 0.872\*\*\* 0.818\*\*\*
Sepal.Width -0.118 -0.428\*\*\* -0.366\*\*\*
Petal.Length 0.872\*\*\* -0.428\*\*\* 0.963\*\*\*
Petal.Width 0.818\*\*\* -0.366\*\*\* 0.963\*\*\*
Computed correlation used pearson-method with listwise-deletion.
## descriptive statistics
psych::describe(imp) # error
#> Warning in mean.default(x, na.rm = na.rm): argument is not numeric or logical:
#> returning NA
#> Error in is.data.frame(x): 'list' object cannot be coerced to type 'double'
mean(imp$data$age) # inefficient
#> [1] 1.76
iris %>%
select(-c(Species)) %>%
psych::describe() %>%
select(-(c(vars, n, median, trimmed, mad))) %>%
tab_df() # pretty
mean sd min max range skew kurtosis se
5.84 0.83 4.30 7.90 3.60 0.31 -0.61 0.07
3.06 0.44 2.00 4.40 2.40 0.31 0.14 0.04
3.76 1.77 1.00 6.90 5.90 -0.27 -1.42 0.14
1.20 0.76 0.10 2.50 2.40 -0.10 -1.36 0.06
Created on 2021-12-11 by the reprex package (v2.0.1)
The previous code was incorrect. I have created two functions, mice_df and mice_cor (link to Github repo here) that will generate a correlation matrix and a table of descriptive statistics from a mids object using Rubin's Rules.
gtsummary will neatly format models based on mids objects.
library(mice)
library(gtsummary)
library(tablecloth)
library(dplyr)
data(nhanes)
imp <- mice(nhanes, m = 3, print = FALSE)
mod <- with(imp, lm(age ~ bmi + chl))
tbl_regression(as.mira(mod)) %>% as_kable()
vs <- c("bmi", "chl", "age", "hyp")
title <- "Table 1: Correlation matrix"
mice_cor(imp = imp,
vs = vs,
title = title)
I see my error now - I was using imp when I should have used imp$data. This works for tab_df and tab_corr, but not tab_model..
library(tidyverse)
library(sjPlot)
library(mice)
library(psych)
set.seed(123)
# Imputed data
data(nhanes)
imp <- mice(nhanes, print = FALSE)
## tab_df
imp$data %>%
select(age, bmi, chl) %>%
psych::describe() %>%
tab_df(.)
## tab_corr
imp$data %>%
select(age, bmi, chl) %>%
tab_corr(.)
## tab_model
mod <- with(imp, lm(age ~ bmi)) %>% pool()
summary(mod)
tab_model(mod) # error...
#> Error in fam.info$is_linear || identical(fam.info$link_function, "identity"): invalid 'x' type in 'x || y'
These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
I using an lm() like function called robu() from library robumeta within my own function foo.
However, I'm manipulating the formula argument such that when it is missing the default formula would be: formula(dint~1) or else any formula that user defines.
It works fine, however, in the output of foo the printed formula call always is: Model: missing(f) if formula(dint ~ 1) regardless of what formula is inputted in the foo.
Can I correct this part of output so that it only shows the exact formula used? (see below examples)
dat <- data.frame(dint = 1:9, SD = 1:9*.1,
time = c(1,1,2,3,4,3,2,4,1),
study.name = rep(c("bob", "jim", "jon"), 3))
library(robumeta)
# MY FUNCTION:
foo <- function(f, data){
robu(formula = if(missing(f)) formula(dint~1) else formula(f), data = data, studynum = study.name, var = SD^2)
}
# EXAMPLES OF USE:
foo(data = dat) ## HERE I expect: `Model: dint ~ 1`
foo(dint~as.factor(time), data = dat) ## HERE I expect: `Model: dint ~ time`
One option is to update the 'ml' object
foo <- function(f, data){
fmla <- if(missing(f)) {
formula(dint ~ 1)
} else {
formula(f)
}
model <- robu(formula = fmla, data = data, studynum = study.name, var = SD^2)
model$ml <- fmla
model
}
-checking
foo(data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ 1
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 96.83379
Tau.sq = 9.985899
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 4.99 0.577 8.65 2 0.0131 2.51 7.48 **
---
Signif. codes: < .01 *** < .05 ** < .10 *
---
Note: If df < 4, do not trust the results
foo(dint~ as.factor(time), data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ as.factor(time)
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 97.24601
Tau.sq = 11.60119
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 3.98 2.50 1.588 2.00 0.253 -6.80 14.8
2 as.factor.time.2 1.04 4.41 0.236 1.47 0.842 -26.27 28.3
3 as.factor.time.3 1.01 1.64 0.620 1.47 0.617 -9.10 11.1
4 as.factor.time.4 2.52 2.50 1.007 2.00 0.420 -8.26 13.3
---
Signif. codes: < .01 *** < .05 ** < .10 *
I am getting the error above when trying to use the cv.lm fucntion. Please see my code
sample<-read.csv("UU2_1_lung_cancer.csv",header=TRUE,sep=",",na.string="NA")
sample1<-sample[2:2000,3:131]
samplex<-sample[2:50,3:131]
y<-as.numeric(sample1[1,])
y<-as.numeric(sample1[2:50,2])
x1<-as.numeric(sample1[2:50,3])
x2<-as.numeric(sample1[2:50,4])
x11<-x1[!is.na(y)]
x12<-x2[!is.na(y)]
y<-y[!is.na(y)]
fit1 <- lm(y ~ x11 + x12, data=sample)
fit1
x3<-as.numeric(sample1[2:50,5])
x4<-as.numeric(sample1[2:50,6])
x13<-x3[!is.na(y)]
x14<-x4[!is.na(y)]
fit2 <- lm(y ~ x11 + x12 + x13 + x14, data=sample)
anova(fit1,fit2)
install.packages("DAAG")
library("DAAG")
cv.lm(df=samplex, fit1, m=10) # 3 fold cross-validation
Any insight will be appreciated.
Example of data
ID peak height LCA001 LCA002 LCA003
N001786 32391.111 0.397 0.229 -0.281
N005356 32341.473 0.397 -0.655 -1.301
N002416 32215.474 -0.703 -0.214 -0.901
GS239 31949.777 0.354 0.118 0.272
N016343 31698.853 0.226 0.04 -0.006
N003255 31604.978 0.024 NA -0.534
N004358 31356.597 -0.252 -0.022 -0.407
N000122 31168.09 -0.487 -0.533 -0.134
GS10564 31106.103 -0.156 -0.141 -1.17
GS17987 31043.876 NA 0.253 0.553
N003674 30876.207 0.109 0.093 0.07
Please see the example of the data above
First, you are using lm(..) incorrectly, or at least in a very unconventional way. The purpose of specifying the data=sample argument is so that the formula uses references to columns of the sample. Generally, it is a very bad practice to use free-standing data in the formula reference.
So try this:
## not tested...
sample <- read.csv(...)
colnames(sample)[2:6] <- c("y","x1","x2","x3","x4")
fit1 <- lm(y~x1+x2, data=sample[2:50,],na.action=na.omit)
library(DAAG)
cv.lm(df=na.omit(sample[2:50,]),fit1,m=10)
This will give columns 2:6 the appropriate names and then use those in the formula. The argument na.action=na.omit tells the lm(...) function to exclude all rows where there is an NA value in any of the relevant columns. This is actually the default, so it is not needed in this case, but included for clarity.
Finally, cv.lm(...) uses it's second argument to find the formula definition, so in your code:
cv.lm(df=samplex, fit1, m=10)
is equivalent to:
cv.lm(df=samplex,y~x11+x12,m=10)
Since there are (presumeably) no columns named x11 and x12 in samplex, and since you define these vectors externally, cv.lm(...) throws the error you are getting.
I did coxph for my data and get result like this:
> z
Call:
coxph(formula = Surv(Years, Event) ~ y, data = x)
coef exp(coef) se(coef) z p
y 0.0714 1.07 0.288 0.248 0.8
Likelihood ratio test=0.06 on 1 df, p=0.804 n= 65, number of events= 49
I just want to save
y 0.0714 1.07 0.288 0.248 0.8
into a file. Because I do permutation and generate 1000 z.
I want to save them into a text file like this:
fin -0.3794 0.684 0.1914 -1.983 0.0470
age -0.0574 0.944 0.0220 -2.611 0.0090
race 0.3139 1.369 0.3080 1.019 0.3100
wexp -0.1498 0.861 0.2122 -0.706 0.4800
mar -0.4337 0.648 0.3819 -1.136 0.2600
paro -0.0849 0.919 0.1958 -0.434 0.6600
Anyone can help?
Thanks!
The coefficients are easily accessed by
summary(z)[['coefficients']]
and the confidence interval information by
summary(z)[['conf.int']]
To find out what the components of a summary.coxph object
str(summary(z))
My advice would be to create a list of your permutations
data_list <- list(data_1, ...., data_1000)
Then call
lots_models <- lapply(data_list, coxph, formula = Surv(Years, Event) ~ y)
Which creates a list of models
You can create the summaries by
lots_summaries <- lapply(lots_models, summary)
Extract the coefficients
all_coefficients <- lapply(lots_summaries, '[[', 'coefficients')
all_conf.int <- lapply(lots_summaries, '[[', 'conf.int')
Add a permutation id column (if you want)
all_coefs_id <- lapply(seq_along(data_list),
function(i) cbind(all_coefficients[[i]],i))
all_ci_id <- lapply(seq_along(data_list),
function(i) cbind(all_conf.int[[i]],i))
Then combine into a data.frame
all_coefs_df <- do.call(rbind, all_coefs_id)
all_ci_df <- do.call(rbind, all_ci_id)
Which you than then save as a text file