Loop through variables to make regressions - r

I am wondering if I can run multiple regressions over this data frame:
Country Years FDI_InFlow_MilUSD FDI_InFlow_percGDP FDI_InStock_MilUSD FDI_OutFlow_MilUSD FDI_OutFlow_percGDP
1 Netherlands 1990 11063.31 3.52 71827.79 14371.94 34.96
2 Romania 1990 0.01 0.00 0.01 18.00 0.16
3 Netherlands 1991 6074.61 1.88 75404.38 13484.54 37.09
4 Romania 1991 40.00 0.13 44.00 3.00 0.29
5 Netherlands 1992 6392.10 1.78 73918.54 13153.78 33.15
6 Romania 1992 77.00 0.37 122.00 4.00 0.38
I would like to run the regression over all variables of interest 3:7 in this case(my original data has 10 variables but I think that's enough to get the point of what I want). Also I would like to have the lm results stored in a data frame and also grouped by Country(if that's possible), rather than making 2 dfs for each Country and then looping through them..
Here's an example of a wanted df(this one isn't grouped):
# term estimate std.error statistic p.value
# 1 (Intercept) -3.2002150 0.256885790 -12.457735 8.141394e-25
# 2 Sepal.Length 0.7529176 0.043530170 17.296454 2.325498e-37
# 3 (Intercept) 3.1568723 0.413081984 7.642242 2.474053e-12
# 4 Sepal.Width -0.6402766 0.133768277 -4.786461 4.073229e-06
# 5 (Intercept) -0.3630755 0.039761990 -9.131221 4.699798e-16
# 6 Petal.Length 0.4157554 0.009582436 43.387237 4.675004e-86
Here's and example of desired result: in this case the calculations are for both countries and are just assigned twice for each Country
Country term estimate std.error statistic p.value
1 Netherlands (Intercept) -67825.16741 2.229068e+04 -3.042759 3.615586e-03
2 Netherlands GDP_pcap_USD 14.04734 7.908839e-01 17.761576 3.285528e-24
3 Romania (Intercept) -67825.16741 2.229068e+04 -3.042759 3.615586e-03
4 Romania GDP_pcap_USD 14.04734 7.908839e-01 17.761576 3.285528e-24
I used this line of code: FDI2 %>% group_by(Country) %>% do(tidy(lm(FDI_InStock_MilUSD ~ GDP_pcap_USD, data= FDI2)))

If I understand correctly, the following will do what you want. All that is needed is to note that lm can fit a multiple regression model and return an object of class "mlm".
models <- lm(as.matrix(df1[-(1:2)]) ~ Country + Years, df1)
class(models)
#[1] "mlm" "lm"
smry <- summary(models)
result <- lapply(smry, coef)
result <- do.call(rbind, result)
head(result)
Estimate Std. Error t value Pr(>|t|)
#(Intercept) 2.294616e+06 1.847179e+06 1.2422273 0.30241037
#CountryRomania -7.804337e+03 1.515033e+03 -5.1512655 0.01418200
#Years -1.148555e+03 9.277644e+02 -1.2379813 0.30377452
#(Intercept) 6.843108e+02 7.063395e+02 0.9688129 0.40410011
#CountryRomania -2.226667e+00 5.793307e-01 -3.8435157 0.03107572
#Years -3.425000e-01 3.547662e-01 -0.9654247 0.40554755

Related

Different result using fixest and multiple fixed effects

First of all, I have to apologize if my headline is misleading. I am not sure how to put it appropriately for my question.
I am currently working on the fixed-effect model. My data looks like the following table, although it is not actual data due to the information privacy.
state
district
year
grade
Y
X
id
AK
1001
2009
3
0.1
0.5
1001.3
AK
1001
2010
3
0.8
0.4
1001.3
AK
1001
2011
3
0.5
0.7
1001.3
AK
1001
2009
4
1.5
1.3
1001.4
AK
1001
2010
4
1.1
0.7
1001.4
AK
1001
2011
4
2.1
0.4
1001.4
...
...
...
..
..
..
...
WY
5606
2011
6
4.2
5.3
5606.6
I used the fixest package to run the fixed-effect model for this project. To get the unique observation in this dataset, I have to combine district, grade, and year. Note that I avoided using plm because there is no way to specify three fixed effects in the model unless you combine two identities (in my case, I generated id by combining district and grade). fixest seems to be able to solve this problem. However, I got different results when specifying three fixed effects (district, grade, and year) compared to two fixed effects (id and year). The following results and codes may clear up some confusion from my explanation.
# Two fixed effects (id and year)
df <- transform(df, id = apply(df[c("district", "grade")], 1, paste, collapse = "."))
fe = feols(y ~ x | id + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: id: 64,302, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.012672 0.003602 3.51804 0.00043478 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.589222 Adj. R2: 0.761891
Within R2: 2.846e-5
###########################################################################
# Three fixed effects (district, grade, and year)
fe = feols(y ~ x | district + grade + year, df, se = "standard")
summary(fe)
OLS estimation, Dep. Var.: y
Observations: 499,112
Fixed-effects: district: 11,097, grade: 6, year: 10
Standard-errors: IID
Estimate Std. Error t value Pr(>|t|)
X 0.014593 0.00401 3.63866 0.00027408 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.702543 Adj. R2: 0.698399
Within R2: 2.713e-5
Questions:
Why are the results different?
This is an equation I plan to use;. I am not sure which model is associated with this specification. To my feeling, it could be the second one. But if it is the case, why do many websites recommend combining two identities and running normal plm.
Thank you so much for reading my question. Any answers/ suggestions/ advice would be appreciated!
The answer is simply that you are estimating two different models.
Three fixed-effects (FEs):
Year + id FEs (I renamed id in to district_grade):
The first set of fixed-effects is strictly included in the set of FEs of the second estimation, which is more restrictive.
Here is a reproducible example in which we see that we obtain two different sets of coefficients.
data(trade)
est = fepois(Euros ~ log(dist_km) | sw(Origin + Product, Origin^Product) + Year, trade)
etable(est, vcov = "iid")
#> model 1 model 2
#> Dependent Var.: Euros Euros
#>
#> log(dist_km) -1.020*** (1.18e-6) -1.024*** (1.19e-6)
#> Fixed-Effects: ------------------- -------------------
#> Origin Yes No
#> Product Yes No
#> Year Yes Yes
#> Origin-Product No Yes
#> _______________ ___________________ ___________________
#> S.E. type IID IID
#> Observations 38,325 38,325
#> Squared Cor. 0.27817 0.35902
#> Pseudo R2 0.53802 0.64562
#> BIC 2.75e+12 2.11e+12
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that they have different FEs, which confirms that the models estimated are completely different:
summary(fixef(est$`Origin + Product + Year`))
#> Fixed_effects coefficients
#> Origin Product Year
#> Number of fixed-effects 15 20 10
#> Number of references 0 1 1
#> Mean 23.5 -0.012 0.157
#> Standard-deviation 1.15 1.35 0.113
#> COEFFICIENTS:
#> Origin: AT BE DE DK ES
#> 22.91 23.84 24.62 23.62 24.83 ... 10 remaining
#> -----
#> Product: 1 2 3 4 5
#> 0 1.381 0.624 1.414 -1.527 ... 15 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06986 0.006301 0.07463 0.163 ... 5 remaining
summary(fixef(est$`Origin^Product + Year`))
#> Fixed_effects coefficients
#> Origin^Product Year
#> Number of fixed-effects 300 10
#> Number of references 0 1
#> Mean 23.1 0.157
#> Standard-deviation 1.96 0.113
#> COEFFICIENTS:
#> Origin^Product: 101 102 103 104 105
#> 22.32 24.42 24.82 21.28 23.04 ... 295 remaining
#> -----
#> Year: 2007 2008 2009 2010 2011
#> 0 0.06962 0.006204 0.07454 0.1633 ... 5 remaining

Mapping broom::tidy to nested list of {fixest} models and keep name of list element

I want to apply broom::tidy() to models nested in a fixest_multi object and extract the names of each list level as data frame columns. Here's an example of what I mean.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
This command estimates two models for each dep. var. (Ozone and Solar.R) for a subset of each Month plus the full sample. Here's how the resulting object looks like:
> names(multiple_est)
[1] "Full sample" "5" "6" "7" "8" "9"
> names(multiple_est$`Full sample`)
[1] "Ozone" "Solar.R"
I now want to tidy each model object, but keep the information of the Month / Dep.var. combination as columns in the tidied data frame. My desired output would look something like this:
I can run map_dfr from the tidyr package, giving me this result:
> map_dfr(multiple_est, tidy, .id ="Month") %>% head(9)
# A tibble: 9 x 6
Month term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Full sample (Intercept) -71.0 23.6 -3.01 3.20e- 3
2 Full sample Wind -3.06 0.663 -4.61 1.08e- 5
3 Full sample Temp 1.84 0.250 7.36 3.15e-11
4 5 (Intercept) -76.4 82.0 -0.931 3.53e- 1
5 5 Wind 2.21 2.31 0.958 3.40e- 1
6 5 Temp 3.07 0.878 3.50 6.15e- 4
7 6 (Intercept) -70.6 46.8 -1.51 1.45e- 1
8 6 Wind -1.34 1.13 -1.18 2.50e- 1
9 6 Temp 1.64 0.609 2.70 1.29e- 2
But this tidies only the first model of each Month, the model with the Ozone outcome.
My desired output would look something like this:
Month outcome term estimate more columns from tidy
Full sample Ozone (Intercept) -71.0
Full sample Ozone Wind -3.06
Full sample Ozone Temp 1.84
Full sample Solar.R (Intercept) some value
Full sample Solar.R Wind some value
Full sample Solar.R Temp some value
... rows repeated for each month 5, 6, 7, 8, 9
How can I apply tidy to all models and add another column that indicates the outcome of the model (which is stored in the name of the model object)?
So, fixest_mult has a pretty strange setup as I delved deeper. As you noticed, mapping across it or using apply just accesses part of the data frames. In fact, it isn't just the data frames for "Ozone", but actually just the data frames for the first 6 data frames (those for c("Full sample", "5", "6").
If you convert to a list, it access the data attribute, which is a sequential list of all 12 data frames, but dropping the relevant names you're looking for. So, as a workaround, could use pmap() and the names (found in the attributes of the object) to tidy() and then use mutate() for your desired columns.
library(fixest)
library(tidyverse)
library(broom)
multiple_est <- feols(c(Ozone, Solar.R) ~ Wind + Temp, airquality, fsplit = ~Month)
nms <- attr(multiple_est, "meta")$all_names
pmap_dfr(
list(
data = as.list(multiple_est),
month = rep(nms$sample, each = length(nms$lhs)),
outcome = rep(nms$lhs, length(nms$sample))
),
~ tidy(..1) %>%
mutate(
Month = ..2,
outcome = ..3,
.before = 1
)
)
#> # A tibble: 36 × 7
#> Month outcome term estimate std.error statistic p.value
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Full sample Ozone (Intercept) -71.0 23.6 -3.01 3.20e- 3
#> 2 Full sample Ozone Wind -3.06 0.663 -4.61 1.08e- 5
#> 3 Full sample Ozone Temp 1.84 0.250 7.36 3.15e-11
#> 4 Full sample Solar.R (Intercept) -76.4 82.0 -0.931 3.53e- 1
#> 5 Full sample Solar.R Wind 2.21 2.31 0.958 3.40e- 1
#> 6 Full sample Solar.R Temp 3.07 0.878 3.50 6.15e- 4
#> 7 5 Ozone (Intercept) -70.6 46.8 -1.51 1.45e- 1
#> 8 5 Ozone Wind -1.34 1.13 -1.18 2.50e- 1
#> 9 5 Ozone Temp 1.64 0.609 2.70 1.29e- 2
#> 10 5 Solar.R (Intercept) -284. 262. -1.08 2.89e- 1
#> # … with 26 more rows

How to calculate the Intraclass correlation (ICC) in R?

I have a dataset that is in a long format with 200 variables, 94 subjects, and each subject has anywhere from 1 to 3 measurements for each variable.
Eg:
ID measurement var1 var2 . . .
1 1 2 6
1 2 3 8
1 3 6 12
2 1 3 9
2 2 4 4
2 3 5 3
3 1 1 11
3 2 1 4
. . . .
. . . .
. . . .
However, some variables have missing values for one of three measurements. It was suggested to me that before imputing missing values with the mean for the subject, I should use a repeated measures ANOVA or mixed model in order to confirm the repeatability of measurements.
The first thing I found to calculate the ICC was the ICC() function from the psych package. However, from what I can tell this requires that the data have one row per subject and one column per measurement, which would be further complicated by the fact that I have 200 variables I need to calculate the ICC for individually. I did go ahead and calculate the ICC for a single variable, and obtained this output:
Intraclass correlation coefficients
type ICC F df1 df2 p lower bound upper bound
Single_raters_absolute ICC1 0.38 2.8 93 188 0.00000000067 0.27 0.49
Single_random_raters ICC2 0.38 2.8 93 186 0.00000000068 0.27 0.49
Single_fixed_raters ICC3 0.38 2.8 93 186 0.00000000068 0.27 0.49
Average_raters_absolute ICC1k 0.65 2.8 93 188 0.00000000067 0.53 0.74
Average_random_raters ICC2k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Average_fixed_raters ICC3k 0.65 2.8 93 186 0.00000000068 0.53 0.74
Number of subjects = 94 Number of Judges = 3
Next, I tried to calculate the ICC using a mixed model. Using this code:
m1 <- lme(var1 ~ measurement, random=~1|ID, data=mydata, na.action=na.omit)
summary(m1)
The output looks like this:
Linear mixed-effects model fit by REML
Data: mydata
AIC BIC logLik
-1917.113 -1902.948 962.5564
Random effects:
Formula: ~1 | ORIGINAL_ID
(Intercept) Residual
StdDev: 0.003568426 0.004550419
Fixed effects: var1 ~ measurement
Value Std.Error DF t-value p-value
(Intercept) 0.003998953 0.0008388997 162 4.766902 0.0000
measurement 0.000473053 0.0003593452 162 1.316429 0.1899
Correlation:
(Intr)
measurement -0.83
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-3.35050264 -0.30417725 -0.03383329 0.25106803 12.15267443
Number of Observations: 257
Number of Groups: 94
Is this the correct model to use to assess ICC? It is not clear to me what the correlation (Intr) is measuring, and it is different from the ICC obtained using ICC().
This is my first time calculating and using intraclass correlation, so any help is appreciated!
Using a mock dataset...
set.seed(42)
n <- 6
dat <- data.frame(id=rep(1:n, 2),
group= as.factor(rep(LETTERS[1:2], n/2)),
V1 = rnorm(n),
V2 = runif(n*2, min=0, max=100),
V3 = runif(n*2, min=0, max=100),
V4 = runif(n*2, min=0, max=100),
V5 = runif(n*2, min=0, max=100))
Loading some libraries...
library(lme4)
library(purrr)
library(tidyr)
# Add list of variable names to the vector below...
var_list <- c("V1","V2","V3","V4","V5")
map_dfr() is from the purrr library. I use lme4::VarCorr() to get the variances at each level.
map_dfr(var_list,
function(x){
formula_mlm = as.formula(paste0(x,"~ group + (1|id)"));
model_fit = lmer(formula_mlm,data=dat);
re_variances = VarCorr(model_fit,comp="Variance") %>%
data.frame() %>%
dplyr::mutate(variable = x);
return(re_variances)
}) %>%
dplyr::select(variable,grp,vcov) %>%
pivot_wider(names_from="grp",values_from="vcov") %>%
dplyr::mutate(icc = id/(id+Residual))

No P or F values in Two Way ANOVA on R

I'm doing an assignment for university and have copied and pasted the R code so I know it's right but I'm still not getting any P or F values from my data:
Food Temperature Area
50 11 820.2175
100 11 936.5437
50 14 1506.568
100 14 1288.053
50 17 1692.882
100 17 1792.54
This is the code I've used so far:
aovdata<-read.table("Condition by area.csv",sep=",",header=T)
attach(aovdata)
Food <- as.factor(Food) ; Temperature <- as.factor(Temperature)
summary(aov(Area ~ Temperature*Food))
but then this is the output:
Df Sum Sq Mean Sq
Temperature 2 757105 378552
Food 1 1 1
Temperature:Food 2 35605 17803
Any help, especially the code I need to fix it, would be great. I think there could be a problem with the data but I don't know what.
I would do this. Be aware of difference between factor and continues predictors.
library(tidyverse)
df <- sapply(strsplit(c("Food Temperature Area", "50 11 820.2175", "100 11 936.5437",
"50 14 1506.568", "100 14 1288.053", "50 17 1692.882",
"100 17 1792.54")," +"), paste0, collapse=",") %>%
read_csv()
model <- lm(Area ~ Temperature * as.factor(Food),df)
summary(model)
#>
#> Call:
#> lm(formula = Area ~ Temperature * as.factor(Food), data = df)
#>
#> Residuals:
#> 1 2 3 4 5 6
#> -83.34 25.50 166.68 -50.99 -83.34 25.50
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -696.328 505.683 -1.377 0.302
#> Temperature 145.444 35.580 4.088 0.055 .
#> as.factor(Food)100 38.049 715.144 0.053 0.962
#> Temperature:as.factor(Food)100 -2.778 50.317 -0.055 0.961
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 151 on 2 degrees of freedom
#> Multiple R-squared: 0.9425, Adjusted R-squared: 0.8563
#> F-statistic: 10.93 on 3 and 2 DF, p-value: 0.08498
ggeffects::ggpredict(model,terms = c('Temperature','Food')) %>% plot()
Created on 2020-12-08 by the reprex package (v0.3.0)
The actual problem with your example is not that you're using factors as predictor variables, but rather that you have fitted a 'saturated' linear model (as many parameters as observations), so there is no variation left to compute a residual SSQ, so the ANOVA doesn't include F/P values etc.
It's fine for temperature and food to be categorical (factor) predictors, that's how they would be treated in a classic two-way ANOVA design. It's just that in order to analyze this design with the interaction you need more replication.

Multiple LM model returning the same coefficients

Hello Stack Community,
I am trying to model wage growth across US territories using linear models to forecast into the future. I want to try and create a model for each state/ territory (DC, VI, and PR), however, when I look at the coefficients for my models, they are the same for each state.
I have used a combination of plyr ,dplyr, and broom thus far to create and sort my data frame (named stuben_dat) for this project
#Wage Growth
state_data = stuben_dat %>% group_by(st) %>%
do (state_wg= lm(wage_growth ~ us_wage_growth + lag_wage_growth + dum1
+dum2 +dum3,
data= stuben_dat, subset=yr>= (current_year - 5)))
#The dummy variables adjust for seasonality (q1 vs q2 vs q3 vs q4)
#The current_year = whatever year I last updated the program
#The current_year-5 value lets me change the look back period
#This look back period can be used to exclude recessions or outliers
Here is just a snapshot of my output, and as you can see, the beta coefficients and regression statistics are exactly the same for each state (Just AK and AL) are shown here. However, I want to build a different model for each state.
# A tibble: 318 x 6
# Groups: st [53]
st term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 AK (Intercept) -1.75 0.294 -5.97 3.28e- 9
2 AK us_wage_growth 996. 23.6 42.2 1.82e-228
3 AK lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
4 AK dum1 -0.245 0.304 -0.806 4.21e- 1
5 AK dum2 -0.321 0.304 -1.06 2.90e- 1
6 AK dum3 0.0947 0.303 0.312 7.55e- 1
7 AL (Intercept) -1.75 0.294 -5.97 3.28e- 9
8 AL us_wage_growth 996. 23.6 42.2 1.82e-228
9 AL lag_wage_growth 0.191 0.0205 9.34 5.58e- 20
10 AL dum1 -0.245 0.304 -0.806 4.21e- 1
# ... with 308 more rows
It is because you are using the same data in your do() call. Try out:
state_data = stuben_dat %>%
group_by(st) %>%
do(state_wg = lm(wage_growth ~ us_wage_growth + lag_wage_growth +
dum1 + dum2 + dum3,
data = ., subset = (yr >= (current_year - 5))))

Resources