R: how to call feols regression within a function - r

I am trying to write a function to return regression coefficient and standard errors since I need run a large number of regressions.
The data could look like this
library(tidyverse)
library(fixest)
library(broom)
data<-tibble(Date = c("2020-01-01","2020-01-01","2020-01-01","2020-01-01","2020-02-01","2020-02-01","2020-02-01","2020-02-01"),
Card = c(1,2,3,4,1,2,3,4),
A = rnorm(8),
B = rnorm(8),
C = rnorm(8)
)
My current code is as following
estimation_fun <- function(col1,col2,df) {
regression<-feols(df[[col1]] ~ df[[col2]] | Card + Date, df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",example)
However, it does not work. I guess it is related to column name in feols because I can make it work for lm().

feols function needs a formula object. You can create it using paste0/sprintf.
estimation_fun <- function(col1,col2,df) {
regression<-feols(as.formula(sprintf('%s ~ %s | Card + Date', col1, col2)), df)
est =tidy(regression)$estimate
se = tidy(regression)$std.error
output <- list(est,se)
return(output)
}
estimation_fun("A","B",data)
#[[1]]
#[1] -0.1173276
#attr(,"type")
#[1] "Clustered (Card)"
#[[2]]
#[1] 1.083011
#attr(,"type")
#[1] "Clustered (Card)"
To apply this for every pair of variables you may do -
cols <- names(data)[-(1:2)]
do.call(rbind, combn(cols, 2, function(x) {
data.frame(cols = paste0(x, collapse = '-'),
t(estimation_fun(x[1],x[2],data)))
}, simplify = FALSE))
cols X1 X2
#1 A-B -0.1173276 1.083011
#2 A-C -0.1117691 0.5648162
#3 B-C -0.3771884 0.1656587

Ronak's right: only formulas made of variable names can be used.
Since fixest 0.10.0, you can use the dot square bracket operator to do just that. See the help page for formula manipulation in xpd.
Just change one line in your code to make it work:
estimation_fun <- function(lhs, rhs, df) {
# lhs must be of length 1 (otherwise => not what you'd want)
# rhs can be a vector of variables
regression <- feols(.[lhs] ~ .[rhs] | Card + Date, df)
# etc...
}
# Example of how ".[]" works:
lhs = "A"
rhs = c("B", "C")
feols(.[lhs] ~ .[rhs], data)
#> OLS estimation, Dep. Var.: A
#> Observations: 8
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.375548 0.428293 0.876849 0.42069
#> B -0.670476 0.394592 -1.699164 0.15004
#> C 0.177647 0.537452 0.330536 0.75440
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.737925 Adj. R2: 0.183702
By the way, I recommend to use the built-in multiple estimation facility (see help here) since estimation speed will be substantially improved.
Update
All combinations can be estimated with one line of code:
# All combinations at once
est_all = feols(c(A, B, C) ~ sw(A, B, C) | Card + Date, data)
Extraction of coefs/SEs can be done with another line:
# Coef + SE // see doc for summary.fixest_multi
coef_se_all = summary(est_all, type = "se_long")
coef_se_all
#> lhs rhs type A B C
#> 1 A A coef 1.0000000 NA NA
#> 2 A A se NaN NA NA
#> 3 A B coef NA 0.8204932 NA
#> 4 A B se NA 1.1102853 NA
#> 5 A C coef NA NA -0.7889534
#> 6 A C se NA NA 0.3260451
#> 7 B A coef 0.2456443 NA NA
#> 8 B A se 0.2314143 NA NA
#> 9 B B coef NA 1.0000000 NA
#> 10 B B se NA NaN NA
#> 11 B C coef NA NA -0.1977089
#> 12 B C se NA NA 0.3335988
#> 13 C A coef -0.4696954 NA NA
#> 14 C A se 0.3858851 NA NA
#> 15 C B coef NA -0.3931512 NA
#> 16 C B se NA 0.8584968 NA
#> 17 C C coef NA NA 1.0000000
#> 18 C C se NA NA NaN
NOTA: it requires fixest 0.10.1 or higher.

Related

PCA x must be numeric in R

I have a dataset like this called df
head(df[, 1:3])
ratio
P
T
H
S
p1
p2
PM10
CO2
B
G
Month
Year
0.5
89
-7
98
133
0
40
50
30
3
20
1
2019
0.5
55
4
43
43
30
30
40
32
1
15
1
2019
0.85
75
4
63
43
30
30
42
32
1
18
1
2019
I would like to do a principal component analysis to reduced number of variables for regression analysis. I gave that code
library(factoextra)
df.pca <- prcomp(df, scale = TRUE)
But I got this error message and for that reason I was not able to continue
Error in colMeans(x, na.rm = TRUE) : ​​'x' must be numeric
What I am doing wrong?
prcomp() will assume that every column in the object you are passing to it should be used in the analysis. You'll need to drop any non-numeric columns, as well as any numeric columns that should not be used in the PCA.
library(factoextra)
# Example data
df <- data.frame(
x = letters,
y1 = rbinom(26,1,0.5),
y2 = rnorm(26),
y3 = 1:26,
id = 1:26
)
# Reproduce your error
prcomp(df)
#> Error in colMeans(x, na.rm = TRUE): 'x' must be numeric
# Remove all non-numeric columns
df_nums <- df[sapply(df, is.numeric)]
# Conduct PCA - works but ID column is in there!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=4):
#> [1] 1.445005e+00 1.039765e+00 9.115092e-01 1.333315e-16
#>
#> Rotation (n x k) = (4 x 4):
#> PC1 PC2 PC3 PC4
#> y1 0.27215111 -0.5512026 -0.7887391 0.000000e+00
#> y2 0.07384194 -0.8052981 0.5882536 4.715914e-16
#> y3 -0.67841033 -0.1543868 -0.1261909 -7.071068e-01
#> id -0.67841033 -0.1543868 -0.1261909 7.071068e-01
# Remove ID
df_nums$id <- NULL
# Conduct PCA without ID - success!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=3):
#> [1] 1.1253120 0.9854030 0.8733006
#>
#> Rotation (n x k) = (3 x 3):
#> PC1 PC2 PC3
#> y1 -0.6856024 0.05340108 -0.7260149
#> y2 -0.4219813 -0.84181344 0.3365738
#> y3 0.5931957 -0.53712052 -0.5996836

How do I use two dataframe conditions with for loop to get nrow(output)

I am very green to R so please bear with my wording. I have a df from a csv that has 106 obs of 11 variables. I only care about 2 of those variables so I made a new df called "df."
bc=read.csv("---.csv")
df=cbind.data.frame('A'=bc$A,'B'=bc$B)
#Example of the new df:
A B
mass 0.1
mass 0.2
height 0.5
height 0.3
color 0.9
color 0.1
Then I made (4) vectors, each based on how many rows could satisfy (2) simultaneous conditions: greater than OET or less than OET AND type is "mass" or type is not "mass."
TP= df[df$B>=i & df$A=="mass",]
TN= df[df$B<=i & df$A!="mass",]
FP= df[df$B<=i & df$A!="mass",]
FN= df[df$B<=i & df$A=="mass",]
I think I want to use a for loop so I could have a vector for every B condition, every i. If I set "i" to a value, the vectors will give me all rows that fit and then nrow("vector") to see how many rows that is- but I cannot type all 106 df$B values into i. I did print to see if my i would work and it showed that I could get every row from df$B. So then I tried with half of the TP vector with df$A. That worked. Now I tried the df$B part alone. But this gave me all 106 obs which I know is wrong becuse the non-looped TP gave me 21 obs. The end goal of the code is to give me a number of TP and and TN for every df$B that meets my (2) conditions so that I can plug them into another function to ggplot. [like Y=TP/TP-TN]
N=c(df$B)
for(i in N){
print(paste(i))
}
# worked
for(i in N){
TPA=df[df$A=="mass",]
TP=nrow(TPA)
}
# worked
for(i in N){
TPB=df[df$B>=i,]
TP=nrow(TPB)
}
#ran but did not do what I wanted
I guess my question is how do I run all rows of df$B against each df$B, all 106 of them, and store them?
When i = df$B[1], how many rows of df$B are >i
When i= df$B[2], how many rows of df$B are >i
From a formula like this, I would like an output like below:
results=data.frame(matrix(nrow=,ncol=4))
colnames(results)=c("A","B","TP","TN")
B=rep(c("mass","not mass"),each=106)
N=c(df$B)
for(i in N){
TPC=df[df$A=='mass' & df$B>=i,]
TP=nrow(TPC)
TNC=df[df$A!='mass' & df$B<=i,]
TN=nrow(TNC)
}
results=cbind.data.frame(B,A,results)
B A TP TN
mass df$B[1] 21 0
mass df$B[2] 18 12
...
notmass df$B[1] 1 11
notmass df$B[2] 3 10
...
If you read this far, thank you! Any direction or answer would be most appreciated!
I'm not sure I'm understanding the terms of your confusion matrix properly, but here's a suggestion for a general approach that seems to me more idiomatic to R, using in this case dplyr and tidyr.
Starting with your data:
df1 <- data.frame(
stringsAsFactors = FALSE,
A = c("mass", "mass", "height", "height", "color", "color"),
B = c(0.1, 0.2, 0.5, 0.3, 0.9, 0.1)
)
We can add a logical mass variable to capture if A is or isn't equal to "mass". We can also make a list of the values of B to use later.
df1$mass = df1$A == "mass"
B_val = sort(unique(df1$B))
Below, I make a copy of the data for each value of B_val and use dplyr::case_when to define the values of the confusion matrix. (I suspect I don't have these right, but should be simple to fix.)
Finally, at the bottom I count how many combinations arise, and then reshape the data into wider format with columns named for each conclusion.
library(dplyr); library(tidyr)
df1 %>%
crossing(B_val) %>%
mutate(type = case_when(
B >= B_val & mass ~ "TP",
B <= B_val & !mass ~ "TN",
B <= B_val & mass ~ "FP",
B >= B_val & !mass ~ "FN",
TRUE ~ "undefined"
)) %>%
count(mass, B_val, type) %>%
# group_by(mass, B_val) %>% #un-comment these lines for proportions
# mutate(n = n / sum(n)) %>%
pivot_wider(names_from = type, values_from = n)
This produces the output below:
# A tibble: 10 x 6
mass B_val FN TN TP FP
<lgl> <dbl> <int> <int> <int> <int>
1 FALSE 0.1 3 1 NA NA
2 FALSE 0.2 3 1 NA NA
3 FALSE 0.3 2 2 NA NA
4 FALSE 0.5 1 3 NA NA
5 FALSE 0.9 NA 4 NA NA
6 TRUE 0.1 NA NA 2 NA
7 TRUE 0.2 NA NA 1 1
8 TRUE 0.3 NA NA NA 2
9 TRUE 0.5 NA NA NA 2
10 TRUE 0.9 NA NA NA 2
Or if looking at proportions:
mass B_val FN TN TP FP
<lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 FALSE 0.1 0.75 0.25 NA NA
2 FALSE 0.2 0.75 0.25 NA NA
3 FALSE 0.3 0.5 0.5 NA NA
4 FALSE 0.5 0.25 0.75 NA NA
5 FALSE 0.9 NA 1 NA NA
6 TRUE 0.1 NA NA 1 NA
7 TRUE 0.2 NA NA 0.5 0.5
8 TRUE 0.3 NA NA NA 1
9 TRUE 0.5 NA NA NA 1
10 TRUE 0.9 NA NA NA 1

Using reformulate to create a formulae with two tilda

I am trying to use reformulate to feed into an R function to create a formula that looks like this.
~ X, ~Y
To be fed into this type of function:
as.data.frame(svyby(~X,~Y, design, svymean, na.rm= T))
I know that:
reformulate("X","Y")
returns: Y ~ X
How do I modify reformulate to achieve the above formula? I have tried:
reformulate(c(`~`,"X"),c(`~`,"Y")) # throws an error
form <- noquote(paste0("~",x,",","~",column)) # is not accepted int function
This is the full function I am feeding it into:
Where form contains the 'formulae' portion and where x is pulled in from a vector_vars of c("Q1","Q2") etc., and column is a character variable 'poverty' being fed in from a larger function.
myfun <- function(x){
form <- noquote(paste0("~",x,",","~",column))
cbind(as.data.frame(svyby(form, design, svymean, na.rm = T)),
freq = c(svytable(form, design)))
}
do.call(rbind, lapply(vector_vars, myfun))
Data and function for testing
data <- read_table2("Q50_1 Q50_2 Q38 Q90 pov gender wgt id
yes 3 Yes NA High M 1.3 A
NA 4 No 2 Med F 0.4 B
no 2 NA 4 Low F 1.2 C
maybe 3 No 2 High M 0.5 D
yes NA No NA High M 0.7 E
no 2 Yes 3 Low F 0.56 F
maybe 4 Yes 2 Med F 0.9 G
")
design <- svydesign(id =~id,
weights = ~wgt,
nest = FALSE,
data = data)
vector_vars <- c("Q50_1", "Q38")
create_df<- function(design, vector_vars, column){
# function to retrieve the weighted, mean and se
myfun <- function(x){
form <- noquote(paste0("~",x,",","~",column))
cbind(as.data.frame(svyby(form, design, svymean, na.rm = T)),
freq = c(svytable(form, design)))
}
final <- do.call(rbind, lapply(vector_vars, myfun))
return(final)
}
create_df(design, 'gender')
From the svyby, the ~X and ~Y used are for different arguments i.e. first is for formula and second for by
library(survey)
myfun <- function(design, x, colnm) {
# // first formula
fmla <- reformulate(x)
# // formula for by
by <- reformulate(colnm)
# // return a named list
list(d1 = as.data.frame(svyby(fmla, by, design, svymean, na.rm = TRUE)),
freq = c(svytable(fmla, design)))
}
-testing
lapply(vector_vars, function(x) myfun(design, x, "gender"))
[[1]]
[[1]]$d1
gender Q50_1maybe Q50_1no se.Q50_1maybe se.Q50_1no
F F 0.3383459 0.6616541 0.3026058 0.3026058
M M 0.2000000 0.8000000 0.2148115 0.2148115
[[1]]$freq
maybe no yes
1.40 1.76 2.00
[[2]]
[[2]]$d1
gender Q38No Q38Yes se.Q38No se.Q38Yes
F F 0.2150538 0.7849462 0.2253182 0.2253182
M M 0.4800000 0.5200000 0.3317149 0.3317149
[[2]]$freq
No Yes
1.60 2.76
Based on the results, showed, the column names are different, so rbind wouldn't work. We may use bind_rows from dplyr or rbindlist from data.table with the updated function
myfun <- function(design, x, colnm) {
fmla <- reformulate(x)
by <- reformulate(colnm)
d1 <- as.data.frame(svyby(fmla, by, design, svymean, na.rm = TRUE))
freq <- c(svytable(fmla, design))
d1[names(freq)] <- as.list(freq)
return(d1)
}
library(data.table)
rbindlist(lapply(vector_vars, function(x) myfun(design, x, "gender")), fill = TRUE)
gender Q50_1maybe Q50_1no se.Q50_1maybe se.Q50_1no maybe no yes Q38No Q38Yes se.Q38No se.Q38Yes No Yes
1: F 0.3383459 0.6616541 0.3026058 0.3026058 1.4 1.76 2 NA NA NA NA NA NA
2: M 0.2000000 0.8000000 0.2148115 0.2148115 1.4 1.76 2 NA NA NA NA NA NA
3: F NA NA NA NA NA NA NA 0.2150538 0.7849462 0.2253182 0.2253182 1.6 2.76
4: M NA NA NA NA NA NA NA 0.4800000 0.5200000 0.3317149 0.3317149 1.6 2.76

Why does MuMIn give weird results with MCMCglmm?

As one option for model selection for MCMCglmm (see also this related question) I am trying out model averaging using the package MuMIn. It doesn't seem to work - see output below. Any ideas why? The output looks nonsense. In particular, there are a bunch of NA values for z values, and where these are not NA, they are all exactly 1. This may stem from the fact that all but one model has been assigned a weight of 0, which again seem unrealistic.
Note that in the documentation for MuMIn, it is listed as being compatible with MCMCglmm objects.
Reproducible example:
set.seed(1234)
library(MCMCglmm)
data(bird.families)
n <- Ntip(bird.families)
# Create some dummy variables
d <- data.frame(taxon = bird.families$tip.label,
X1 = rnorm(n),
X2 = rnorm(n),
X3 = sample(c("A", "B", "C"), n, replace = T),
X4 = sample(c("A", "B", "C"), n, replace = T))
# Simulate a phenotype composed of phylogenetic, fixed and residual effects
d$phenotype <- rbv(bird.families, 1, nodes="TIPS") +
d$X1*0.7 +
ifelse(d$X3 == "B", 0.5, 0) +
ifelse(d$X3 == "C", 0.8, 0) +
rnorm(n, 0, 1)
# Inverse matrix of shared phyloegnetic history
Ainv <- inverseA(bird.families)$Ainv
# Set priors
prior <- list(R = list(V = 1, nu = 0.002),
G = list(G1 = list(V = 1, nu = 0.002)))
uMCMCglmm <- updateable(MCMCglmm)
model <- uMCMCglmm(phenotype ~ X1 + X2 + X3 + X4,
random = ~taxon,
ginverse = list(taxon=Ainv),
data = d,
prior = prior,
verbose = FALSE)
# Explore possible simplified models
options(na.action = "na.fail")
dred <- dredge(model)
# Calculate a model average
avg <- model.avg(dred)
summary(avg)
Output:
Call:
model.avg(object = dred)
Component model call:
uMCMCglmm(fixed = phenotype ~ <16 unique rhs>, random = ~taxon, data = d,
prior = prior, verbose = FALSE, ginverse = list(taxon = Ainv))
Component models:
df logLik AICc delta weight
3 5 -49.24 108.93 0.00 1
4 5 -71.18 152.82 43.89 0
(Null) 3 -76.98 160.13 51.20 0
34 7 -90.35 195.56 86.63 0
23 6 -95.03 202.71 93.78 0
24 6 -105.79 224.22 115.29 0
1 4 -134.87 278.04 169.11 0
123 7 -137.36 289.59 180.66 0
2 4 -154.82 317.93 209.00 0
234 8 -162.69 342.51 233.58 0
13 6 -167.74 348.12 239.19 0
124 7 -171.06 356.99 248.05 0
14 6 -172.53 357.70 248.77 0
134 8 -171.60 360.33 251.40 0
12 5 -181.16 372.78 263.84 0
1234 9 -189.33 398.07 289.14 0
Term codes:
X1 X2 X3 X4
1 2 3 4
Model-averaged coefficients:
(full average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.642e-01 NA NA NA
X3B 6.708e-01 6.708e-01 1 0.317
X3C 9.802e-01 9.802e-01 1 0.317
X4B -9.505e-11 9.505e-11 1 0.317
X4C -7.822e-11 7.822e-11 1 0.317
X2 -3.259e-22 3.259e-22 1 0.317
X1 1.378e-37 1.378e-37 1 0.317
(conditional average)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.76421 NA NA NA
X3B 0.67078 NA NA NA
X3C 0.98025 NA NA NA
X4B -0.32229 NA NA NA
X4C -0.26522 NA NA NA
X2 -0.07528 NA NA NA
X1 0.72300 NA NA NA
Relative variable importance:
X3 X4 X2 X1
Importance: 1 <0.01 <0.01 <0.01
N containing models: 8 8 8 8

Cramer's V with missing values gives different results

My questions concern the calculation of the Cramers V to detect correlation between categorial variables. I 've got a dataset with missing values, but I created a fake dataset for illustration with two variables a and b, one of them containing to NA's.
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
df<-cbind(a2,b2)
The assocstats function gives me the result for the cramers V:
require(vcd)
> tab <-table(a,b)
> assocstats(tab)
X^2 df P(> X^2)
Likelihood Ratio 1.7261 4 0.78597
Pearson 1.3333 4 0.85570
Phi-Coefficient : 0.408
Contingency Coeff.: 0.378
Cramer's V : 0.289
Now I want to drop the NA's from the levels
a[a==""]<-NA
a3 <- droplevels(a)
levels(a3)
tab <-table(a,b)
assocstats(tab)
But everytime I remove NA's the result looks like this:
X^2 df P(> X^2)
Likelihood Ratio 0.13844 2 0.93312
Pearson NaN 2 NaN
Phi-Coefficient : NaN
Contingency Coeff.: NaN
Cramer's V : NaN
Also, because I have a large dataset I would like to calculate a matrix of the Cramer V results. I found this code here on stack overflow and it seems to work...
get.V<-function(y){
col.y<-ncol(y)
V<-matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j]<-assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(tab)
Only that the result is different than that with assocstats function:
[,1] [,2] [,3]
[1,] 1.0 0.5 1
[2,] 0.5 1.0 1
[3,] 1.0 1.0 1
This can not be right, because I get this result every time, even when changing the number of observations... what is wrong with this code?
Conclusion:I don't know which one of the result is right. I have a large dataset with a lot of NA's in it. The first asocstat result and the code give different results, altough there is no big difference,because the code only creates a matrix. The second asocstat function gives only NaN.I cant detect any errors... Can somebody help me?
You don't have to replace the "" with NA if you are using factors--any unique value that you don't define in levels will be converted to NA by factor
a <- factor(c("M","","F","F","","M","F","F"))
a2 <- factor(a, levels = c('M','F'),labels =c('Male','Female'))
a
# [1] M F F M F F
# Levels: F M
a2
# [1] Male <NA> Female Female <NA> Male Female Female
# Levels: Male Female
b <- factor(c("y","y","","y","n","n","n","y"))
b2 <- factor(b, levels=c("y","n"),labels=c("yes","no"))
(df <- cbind(a2,b2))
# a2 b2
# [1,] 1 1
# [2,] NA 1
# [3,] 2 NA
# [4,] 2 1
# [5,] NA 2
# [6,] 1 2
# [7,] 2 2
# [8,] 2 1
Above, you're creating a matrix which loses all the labels that you created with factor. I think you want a data frame:
(df <- data.frame(a2,b2))
# a2 b2
# 1 Male yes
# 2 <NA> yes
# 3 Female <NA>
# 4 Female yes
# 5 <NA> no
# 6 Male no
# 7 Female no
# 8 Female yes
require('vcd')
(tab <- table(a2,b2, useNA = 'ifany'))
# b2
# a2 yes no <NA>
# Male 1 1 0
# Female 2 1 1
# <NA> 1 1 0
(tab <- table(a2,b2))
# b2
# a2 yes no
# Male 1 1
# Female 2 1
You need to explicitly tell table if you want to see NA values in the table. Otherwise, it will drop them by default so that you are already "excluding" them when you use assocstats:
assocstats(tab)
# X^2 df P(> X^2)
# Likelihood Ratio 0.13844 1 0.70983
# Pearson 0.13889 1 0.70939
#
# Phi-Coefficient : 0.167
# Contingency Coeff.: 0.164
# Cramer's V : 0.167
For get.V just pass the data frame or matrix, not the table:
get.V <- function(y) {
col.y <- ncol(y)
V <- matrix(ncol=col.y,nrow=col.y)
for(i in 1:col.y){
for(j in 1:col.y){
V[i,j] <- assocstats(table(y[,i],y[,j]))$cramer
}
}
return(V)
}
get.V(df)
# [,1] [,2]
# [1,] 1.0000000 0.1666667
# [2,] 0.1666667 1.0000000

Resources