As one option for model selection for MCMCglmm (see also this related question) I am trying out model averaging using the package MuMIn. It doesn't seem to work - see output below. Any ideas why? The output looks nonsense. In particular, there are a bunch of NA values for z values, and where these are not NA, they are all exactly 1. This may stem from the fact that all but one model has been assigned a weight of 0, which again seem unrealistic.
Note that in the documentation for MuMIn, it is listed as being compatible with MCMCglmm objects.
Reproducible example:
set.seed(1234)
library(MCMCglmm)
data(bird.families)
n <- Ntip(bird.families)
# Create some dummy variables
d <- data.frame(taxon = bird.families$tip.label,
X1 = rnorm(n),
X2 = rnorm(n),
X3 = sample(c("A", "B", "C"), n, replace = T),
X4 = sample(c("A", "B", "C"), n, replace = T))
# Simulate a phenotype composed of phylogenetic, fixed and residual effects
d$phenotype <- rbv(bird.families, 1, nodes="TIPS") +
d$X1*0.7 +
ifelse(d$X3 == "B", 0.5, 0) +
ifelse(d$X3 == "C", 0.8, 0) +
rnorm(n, 0, 1)
# Inverse matrix of shared phyloegnetic history
Ainv <- inverseA(bird.families)$Ainv
# Set priors
prior <- list(R = list(V = 1, nu = 0.002),
G = list(G1 = list(V = 1, nu = 0.002)))
uMCMCglmm <- updateable(MCMCglmm)
model <- uMCMCglmm(phenotype ~ X1 + X2 + X3 + X4,
random = ~taxon,
ginverse = list(taxon=Ainv),
data = d,
prior = prior,
verbose = FALSE)
# Explore possible simplified models
options(na.action = "na.fail")
dred <- dredge(model)
# Calculate a model average
avg <- model.avg(dred)
summary(avg)
Output:
Call:
model.avg(object = dred)
Component model call:
uMCMCglmm(fixed = phenotype ~ <16 unique rhs>, random = ~taxon, data = d,
prior = prior, verbose = FALSE, ginverse = list(taxon = Ainv))
Component models:
df logLik AICc delta weight
3 5 -49.24 108.93 0.00 1
4 5 -71.18 152.82 43.89 0
(Null) 3 -76.98 160.13 51.20 0
34 7 -90.35 195.56 86.63 0
23 6 -95.03 202.71 93.78 0
24 6 -105.79 224.22 115.29 0
1 4 -134.87 278.04 169.11 0
123 7 -137.36 289.59 180.66 0
2 4 -154.82 317.93 209.00 0
234 8 -162.69 342.51 233.58 0
13 6 -167.74 348.12 239.19 0
124 7 -171.06 356.99 248.05 0
14 6 -172.53 357.70 248.77 0
134 8 -171.60 360.33 251.40 0
12 5 -181.16 372.78 263.84 0
1234 9 -189.33 398.07 289.14 0
Term codes:
X1 X2 X3 X4
1 2 3 4
Model-averaged coefficients:
(full average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.642e-01 NA NA NA
X3B 6.708e-01 6.708e-01 1 0.317
X3C 9.802e-01 9.802e-01 1 0.317
X4B -9.505e-11 9.505e-11 1 0.317
X4C -7.822e-11 7.822e-11 1 0.317
X2 -3.259e-22 3.259e-22 1 0.317
X1 1.378e-37 1.378e-37 1 0.317
(conditional average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.76421 NA NA NA
X3B 0.67078 NA NA NA
X3C 0.98025 NA NA NA
X4B -0.32229 NA NA NA
X4C -0.26522 NA NA NA
X2 -0.07528 NA NA NA
X1 0.72300 NA NA NA
Relative variable importance:
X3 X4 X2 X1
Importance: 1 <0.01 <0.01 <0.01
N containing models: 8 8 8 8
Related
I have a dataset like this called df
head(df[, 1:3])
ratio
P
T
H
S
p1
p2
PM10
CO2
B
G
Month
Year
0.5
89
-7
98
133
0
40
50
30
3
20
1
2019
0.5
55
4
43
43
30
30
40
32
1
15
1
2019
0.85
75
4
63
43
30
30
42
32
1
18
1
2019
I would like to do a principal component analysis to reduced number of variables for regression analysis. I gave that code
library(factoextra)
df.pca <- prcomp(df, scale = TRUE)
But I got this error message and for that reason I was not able to continue
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
What I am doing wrong?
prcomp() will assume that every column in the object you are passing to it should be used in the analysis. You'll need to drop any non-numeric columns, as well as any numeric columns that should not be used in the PCA.
library(factoextra)
# Example data
df <- data.frame(
x = letters,
y1 = rbinom(26,1,0.5),
y2 = rnorm(26),
y3 = 1:26,
id = 1:26
)
# Reproduce your error
prcomp(df)
#> Error in colMeans(x, na.rm = TRUE): 'x' must be numeric
# Remove all non-numeric columns
df_nums <- df[sapply(df, is.numeric)]
# Conduct PCA - works but ID column is in there!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=4):
#> [1] 1.445005e+00 1.039765e+00 9.115092e-01 1.333315e-16
#>
#> Rotation (n x k) = (4 x 4):
#> PC1 PC2 PC3 PC4
#> y1 0.27215111 -0.5512026 -0.7887391 0.000000e+00
#> y2 0.07384194 -0.8052981 0.5882536 4.715914e-16
#> y3 -0.67841033 -0.1543868 -0.1261909 -7.071068e-01
#> id -0.67841033 -0.1543868 -0.1261909 7.071068e-01
# Remove ID
df_nums$id <- NULL
# Conduct PCA without ID - success!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=3):
#> [1] 1.1253120 0.9854030 0.8733006
#>
#> Rotation (n x k) = (3 x 3):
#> PC1 PC2 PC3
#> y1 -0.6856024 0.05340108 -0.7260149
#> y2 -0.4219813 -0.84181344 0.3365738
#> y3 0.5931957 -0.53712052 -0.5996836
Is there a way to force model.matrix.lm to drop multicollinear variables, as is done in the estimation stage by lm()?
Here is an example:
library(fixest)
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
summary(fit_lm)
# Call:
# lm(formula = y ~ x1 + x2, data = df)
#
# Residuals:
# Min 1Q Median 3Q Max
# -1.82680 -0.41503 0.05499 0.67185 0.97830
#
# Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7494 0.2885 2.598 0.0317 *
# x1 2.3905 0.3157 7.571 6.48e-05 ***
# x2 NA NA NA NA
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.8924 on 8 degrees of freedom
# Multiple R-squared: 0.8775, Adjusted R-squared: 0.8622
# F-statistic: 57.33 on 1 and 8 DF, p-value: 6.476e-05
Note that lm() drops the collinear variable x2 from the model. But model.matrix() keeps it:
model.matrix(fit_lm)
# (Intercept) x1 x2
#1 1 1.41175158 1.41175158
#2 1 0.06164133 0.06164133
#3 1 0.09285047 0.09285047
#4 1 -0.63202909 -0.63202909
#5 1 0.25189850 0.25189850
#6 1 -0.18553830 -0.18553830
#7 1 0.65630180 0.65630180
#8 1 -1.77536852 -1.77536852
#9 1 -0.30571009 -0.30571009
#10 1 -1.47296229 -1.47296229
#attr(,"assign")
#[1] 0 1 2
The model.matrix method from fixest instead allows to drop x2:
fit_feols <- feols(y ~ x1 + x2, data = df)
model.matrix(fit_feols, type = "rhs", collin.rm = TRUE)
# (Intercept) x1
# [1,] 1 1.41175158
# [2,] 1 0.06164133
# [3,] 1 0.09285047
# [4,] 1 -0.63202909
# [5,] 1 0.25189850
# [6,] 1 -0.18553830
# [7,] 1 0.65630180
# [8,] 1 -1.77536852
# [9,] 1 -0.30571009
# [10,] 1 -1.47296229
Is there a way to drop x2 when calling model.matrix.lm()?
So long as the overhead from running the linear model is not too high, you could write a little function like the one here to do it:
N <- 10
x1 <- rnorm(N)
x2 <- x1
y <- 1 + x1 + x2 + rnorm(N)
df <- data.frame(y = y, x1 = x1, x2 = x2)
fit_lm <- lm(y ~ x1 + x2, data = df)
model.matrix2 <- function(model){
bn <- names(na.omit(coef(model)))
X <- model.matrix(model)
X[,colnames(X) %in% bn]
}
model.matrix2(fit_lm)
#> (Intercept) x1
#> 1 1 -0.04654473
#> 2 1 2.14473751
#> 3 1 0.02688125
#> 4 1 0.95071038
#> 5 1 -1.41621259
#> 6 1 1.47840480
#> 7 1 0.56580182
#> 8 1 0.14480401
#> 9 1 -0.02404072
#> 10 1 -0.14393258
Created on 2022-05-02 by the reprex package (v2.0.1)
In the code above, model.matrix2() is the function that post-processes the model matrix to contain only the variables that have non-missing coefficients in the linear model.
I am trying to write a function to return regression coefficient and standard errors since I need run a large number of regressions.
The data could look like this
library(tidyverse)
library(fixest)
library(broom)
data<-tibble(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01"),
Card = c(1,2,3,4,1,2,3,4),
A = rnorm(8),
B = rnorm(8),
C = rnorm(8)
)
My current code is as following
estimation_fun <- function(col1,col2,df) {
regression<-feols(df[[col1]] ~ df[[col2]] | Card + Date, df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",example)
However, it does not work. I guess it is related to column name in feols because I can make it work for lm().
feols function needs a formula object. You can create it using paste0/sprintf.
estimation_fun <- function(col1,col2,df) {
regression<-feols(as.formula(sprintf('%s ~ %s | Card + Date', col1, col2)), df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",data)
#[[1]]
#[1] -0.1173276
#attr(,"type")
#[1] "Clustered (Card)"
#[[2]]
#[1] 1.083011
#attr(,"type")
#[1] "Clustered (Card)"
To apply this for every pair of variables you may do -
cols <- names(data)[-(1:2)]
do.call(rbind, combn(cols, 2, function(x) {
data.frame(cols = paste0(x, collapse = '-'),
t(estimation_fun(x[1],x[2],data)))
}, simplify = FALSE))
cols X1 X2
#1 A-B -0.1173276 1.083011
#2 A-C -0.1117691 0.5648162
#3 B-C -0.3771884 0.1656587
Ronak's right: only formulas made of variable names can be used.
Since fixest 0.10.0, you can use the dot square bracket operator to do just that. See the help page for formula manipulation in xpd.
Just change one line in your code to make it work:
estimation_fun <- function(lhs, rhs, df) {
# lhs must be of length 1 (otherwise => not what you'd want)
# rhs can be a vector of variables
regression <- feols(.[lhs] ~ .[rhs] | Card + Date, df)
# etc...
}
# Example of how ".[]" works:
lhs = "A"
rhs = c("B", "C")
feols(.[lhs] ~ .[rhs], data)
#> OLS estimation, Dep. Var.: A
#> Observations: 8
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.375548 0.428293 0.876849 0.42069
#> B -0.670476 0.394592 -1.699164 0.15004
#> C 0.177647 0.537452 0.330536 0.75440
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.737925 Adj. R2: 0.183702
By the way, I recommend to use the built-in multiple estimation facility (see help here) since estimation speed will be substantially improved.
Update
All combinations can be estimated with one line of code:
# All combinations at once
est_all = feols(c(A, B, C) ~ sw(A, B, C) | Card + Date, data)
Extraction of coefs/SEs can be done with another line:
# Coef + SE // see doc for summary.fixest_multi
coef_se_all = summary(est_all, type = "se_long")
coef_se_all
#> lhs rhs type A B C
#> 1 A A coef 1.0000000 NA NA
#> 2 A A se NaN NA NA
#> 3 A B coef NA 0.8204932 NA
#> 4 A B se NA 1.1102853 NA
#> 5 A C coef NA NA -0.7889534
#> 6 A C se NA NA 0.3260451
#> 7 B A coef 0.2456443 NA NA
#> 8 B A se 0.2314143 NA NA
#> 9 B B coef NA 1.0000000 NA
#> 10 B B se NA NaN NA
#> 11 B C coef NA NA -0.1977089
#> 12 B C se NA NA 0.3335988
#> 13 C A coef -0.4696954 NA NA
#> 14 C A se 0.3858851 NA NA
#> 15 C B coef NA -0.3931512 NA
#> 16 C B se NA 0.8584968 NA
#> 17 C C coef NA NA 1.0000000
#> 18 C C se NA NA NaN
NOTA: it requires fixest 0.10.1 or higher.
orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
In order to get that to run without error one needs to use skip =1 on the second read.table:
mn <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", nrows = 1, as.is = TRUE)
DF <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", skip = 1)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
So those input data are exactly colinear and you would NOT expect there to be any useful information in either the quadratic or cubic terms. That is in fact recognized by the lm machinery:
> fit1
Call:
lm(formula = x1 ~ poly(y1, 3, raw = TRUE))
Coefficients:
(Intercept) poly(y1, 3, raw = TRUE)1 poly(y1, 3, raw = TRUE)2
-1 1 0
poly(y1, 3, raw = TRUE)3
0
Generally one should be using the data argument
> fit3<-lm(x1 ~ poly(y1, 3, raw=TRUE), DF)
>
> fit4<-lm(x2 ~ poly(y2, 3, raw=TRUE), DF)
But in this case it doesn't seem to matter:
> predict(fit1, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit3, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit2, newdata = list(y1=25:28))
1 2 3 4
3 7 11 15
The way to get predictions is to supply a newdata argument that can be coerced into a dataframe. Using a list value that has items of the same length (in this case a single argument) will succeed.
What I currently have a problem with this problem is understanding how to fimulate 10,000 draws and fix the covariates.
Y
<int>
X1
<dbl>
X2
<dbl>
X3
<int>
1 4264 305.657 7.17 0
2 4496 328.476 6.20 0
3 4317 317.164 4.61 0
4 4292 366.745 7.02 0
5 4945 265.518 8.61 1
6 4325 301.995 6.88 0
6 rows
That is the head of the grocery code.
What I've done so far for other problems related:
#5.
#using beta_hat
#create a matrix with all the Xs and numbers from 1-52
X <- cbind(rep(1,52), grocery$X1, grocery$X2, grocery$X3)
beta_hat <- solve((t(X) %*% X)) %*% t(X) %*% grocery$Y
round(t(beta_hat), 2)
#using lm formula and residuals
#lm formula
lm0 <- lm(formula = Y ~ X1 + X2 + X3, data = grocery)
#6.
residuals(lm0)[1:5]
Below is what the lm() in the original function:
Call:
lm(formula = Y ~ X1 + X2 + X3, data = grocery)
Residuals:
Min 1Q Median 3Q Max
-264.05 -110.73 -22.52 79.29 295.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4149.8872 195.5654 21.220 < 2e-16 ***
X1 0.7871 0.3646 2.159 0.0359 *
X2 -13.1660 23.0917 -0.570 0.5712
X3 623.5545 62.6409 9.954 2.94e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 143.3 on 48 degrees of freedom
Multiple R-squared: 0.6883, Adjusted R-squared: 0.6689
F-statistic: 35.34 on 3 and 48 DF, p-value: 3.316e-12
The result should be a loop that can do the sampling distribution in the t test. Right now what I have is for another problem that focuses on fitting the model based on the data.
Here I'm given the true model (for the true hypothesis) but not sure where to begin with the loop.
Okay, have a look at the following:
# get some sample data:
set.seed(42)
df <- data.frame(X1 = rnorm(10), X2 = rnorm(10), X3 = rbinom(10, 1, 0.5))
# note how X1 gets multiplied with 0, to highlight that the null is imposed.
df$y_star <- with(df, 4200 + 0*X1 - 15*X2 + 620 * X3)
head(df)
X1 X2 X3 y_star
1 1.37095845 1.3048697 0 4180.427
2 -0.56469817 2.2866454 0 4165.700
3 0.36312841 -1.3888607 0 4220.833
4 0.63286260 -0.2787888 1 4824.182
5 0.40426832 -0.1333213 0 4202.000
# define function to get the t statistic
get_tstat <- function(){
# declare the outcome, with random noise added:
# The added random noise here will be different in each draw
df$y <- with(df, y_star + rnorm(10, mean = 0, sd = sqrt(20500)))
# run linear model
mod <- lm(y ~ X1 + X2 + X3, data = df)
return(summary(mod)$coefficients["X1", "t value"])
}
# get 10 values from the t-statistic:
replicate(10, get_tstat())
[1] -0.8337737 -1.2567709 -1.2303073 0.3629552 -0.1203216 -0.1150734 0.3533095 1.6261360
[9] 0.8259006 -1.3979176