Robust regression p values with lmrob

Robust regression p values with lmrob - r

I am trying to do robust multiple regression for a dataset where a few outliers don't allow me to see the underlying patterns through the usual linear models.
I am using the function lmrob in the package robustbase, and I was surprised by the number of significant relationships that I found. I decided to try the method with random data, this is the code:
library(robustbase)
set.seed(4)
ax<-data.frame(a1=rnorm(20,3),
a2=rnorm(20,5),
a3=rnorm(20,4),
a4=rnorm(20,6),
a5=rnorm(20,2))
axm<-lmrob(a1~a2*a3*a4*a5,data=ax)
summary(axm)
And the output:
Call:
lmrob(formula = a1 ~ a2 * a3 * a4 * a5, data = ax)
\--> method = "MM"
Residuals:
1 2 3 4 5 6 7 8 9 10 11 12 13
-34.740270 -0.049493 -0.044379 0.002770 0.219825 0.041285 0.156152 -0.072825 0.034824 -0.014757 -0.088263 -0.185045 -0.079679
14 15 16 17 18 19 20
-0.045121 -0.007576 0.008813 0.010451 0.015716 0.060781 0.040187
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1160.5907 94.0095 -12.35 0.000247 ***
a2 205.6910 15.8689 12.96 0.000204 ***
a3 327.9787 24.2161 13.54 0.000172 ***
a4 193.2384 15.7300 12.29 0.000252 ***
a5 734.2203 49.8960 14.71 0.000124 ***
a2:a3 -57.6229 4.0533 -14.22 0.000142 ***
a2:a4 -33.5644 2.6130 -12.85 0.000212 ***
a3:a4 -54.1622 4.0438 -13.39 0.000180 ***
a2:a5 -138.8395 9.2697 -14.98 0.000116 ***
a3:a5 -198.4961 12.3168 -16.12 8.67e-05 ***
a4:a5 -123.0895 8.2792 -14.87 0.000119 ***
a2:a3:a4 9.3344 0.6659 14.02 0.000150 ***
a2:a3:a5 37.1371 2.2502 16.50 7.89e-05 ***
a2:a4:a5 23.0014 1.5152 15.18 0.000110 ***
a3:a4:a5 32.9766 2.0388 16.18 8.55e-05 ***
a2:a3:a4:a5 -6.0817 0.3660 -16.62 7.68e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Robust residual standard error: 0.4039
Multiple R-squared: 0.9861, Adjusted R-squared: 0.934
Convergence in 5 IRWLS iterations
Robustness weights:
observation 1 is an outlier with |weight| = 0 ( < 0.005);
9 weights are ~= 1. The remaining 10 ones are
2 3 5 7 8 11 12 13 14 19
0.9986 0.9989 0.9732 0.9864 0.9970 0.9957 0.9810 0.9965 0.9989 0.9979
Algorithmic parameters:
tuning.chi bb tuning.psi refine.tol rel.tol scale.tol solve.tol eps.outlier
1.548e+00 5.000e-01 4.685e+00 1.000e-07 1.000e-07 1.000e-10 1.000e-07 5.000e-03
eps.x warn.limit.reject warn.limit.meanrw
1.150e-09 5.000e-01 5.000e-01
nResample max.it best.r.s k.fast.s k.max maxit.scale trace.lev mts compute.rd
500 50 2 1 200 200 0 1000 0
fast.s.large.n
2000
psi subsampling cov compute.outlier.stats
"bisquare" "nonsingular" ".vcov.avar1" "SM"
seed : int(0)
According to this, I understand that the other random variables are related to the first one, and have high predictive power over it, which makes no sense.
What is happening here? I am doing the regression wrong?
Edit: I put a seed for which the p values are extremely low for replicability.

I think I may have found the explanation to such high p-values: turns out that the MM estimate with small sample sizes (as in my example with 20) is prone to type 1 error. One of the authors of the robustbase package has published an article proposing an alternative estimate for this cases, but I'm afraid that for my data it doesn't work so much better.

Related

Plotting mean and standard error of mean from linear regression

I've run a multiple linear regression where pred_acc is the dependent continuous variable and emotion_pred and emotion_target are two dummy coded independent variables with 0 and 1. Furthermore I am interested in the interaction between the two independent variables.
model <- lm(predic_acc ~ emotion_pred * emotion_target, data = data_almost_final)
summary(model)
Residuals:
Min 1Q Median 3Q Max
-0.66049 -0.19522 0.01235 0.19213 0.67284
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.97222 0.06737 14.432 < 2e-16 ***
emotion_pred 0.45988 0.09527 4.827 8.19e-06 ***
emotion_target 0.24383 0.09527 2.559 0.012719 *
emotion_pred:emotion_target -0.47840 0.13474 -3.551 0.000703 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2858 on 68 degrees of freedom
(1224 Beobachtungen als fehlend gelöscht)
Multiple R-squared: 0.2555, Adjusted R-squared: 0.2227
F-statistic: 7.781 on 3 and 68 DF, p-value: 0.0001536
In case some context is needed: I did a survey where couples had to predict their partners preferences. The predictor individual was either in emotion state 0 or 1 (emotion_pred) and the target individual was either in emotion state 0 or 1 (emotion_target). Accordingly, there are four combinations.
Now I want to plot the regression with the means of each combination of the independent variables (0,1; 1,0; 1,1; 0,0) and add an error bar with the standard error of the means. I have literally no idea at all how to do this. Anyone can help me with this?
Here's an extraction from my data:
pred_acc emotion_pred emotion_target
1 1.0000000 1 0
2 1.2222222 0 1
3 0.7777778 0 0
4 1.1111111 1 1
5 1.3888889 1 1
Sketch of how I want it to look like

Using emmip from the emmeans library:
model <- lm(data=d2, pred_acc ~ emotion_pred*emotion_target)
emmip(model, emotion_pred ~ emotion_target, CIs = TRUE, style = "factor")
If you want more control over the image or just to get the values you can use the emmeans function directly:
> emmeans(model , ~ emotion_pred * emotion_target )
emotion_pred emotion_target emmean SE df lower.CL upper.CL
0 0 0.778 0.196 1 -1.718 3.27
1 0 1.000 0.196 1 -1.496 3.50
0 1 1.222 0.196 1 -1.274 3.72
1 1 1.250 0.139 1 -0.515 3.01
Then you can use ggplot on this dataframe to make whatever graph you like.

How to use ANOVA (function = aov) for a fractional factorial designed experiment with 6 independent variables in R

I have done an incomplete factorial design (fractional factorial design) experiment on different fertilizer applications.
Here is the format of the data: Excerpt of data
I want to do an ANOVA in R using the function aov. I have 450 data points in total, 'Location' has 5 factors, N has 3, and F1,F2,F3+F4 have two each.
Here is the code that I am using:
ANOVA1<-aov(PlantWeight~Location*N*F1*F2*F3*F4, data = data)
summary(ANOVA1)
Independent variables F1,F2,F3+F4 are not applied in a factorial manner. Each sample either has F1,F2,F3+F4 or nothing applied. In the cases where no F1,F2,F3 or F4 fertiliser was applied the value 0 has been put in every column. This is the control to which each of F1,F2,F3+F4 will be compared to. If F1 has been applied then column F1 will read 1 and it will read NA in F2,F3+F4 columns.
When I try to run this ANOVA I get this error message:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
Another approach I had was to but an 'x' instead of 'NA'. This has issues because it is assuming that x is a factor when it is not. It seemed to work fine except it would always ignore the F4.
ANOVA2<-aov(PlantWeight~((F1*Location*N)+(F2*Location*N)+(F3*Location*N)+
(F4*Location*N)), data = data)
summary(ANOVA2)
Results:
Df Sum Sq Mean Sq F value Pr(>F)
F1 2 10.3 5.13 5.742 0.00351 **
Location 6 798.6 133.11 149.027 < 2e-16 ***
N 2 579.6 289.82 324.485 < 2e-16 ***
F2 1 0.3 0.33 0.364 0.54667
F3 1 0.4 0.44 0.489 0.48466
F1:Location 10 26.5 2.65 2.962 0.00135 **
F1:N 4 6.6 1.66 1.857 0.11737
Location:N 10 113.5 11.35 12.707 < 2e-16 ***
Location:F2 5 6.5 1.30 1.461 0.20188
N:F2 2 2.7 1.37 1.537 0.21641
Location:F3 5 33.6 6.72 7.529 9.73e-07 ***
N:F3 2 2.5 1.23 1.375 0.25409
F1:Location:N 20 12.4 0.62 0.696 0.83029
F2:Location:N 10 18.9 1.89 2.113 0.02284 *
F3:Location:N 10 26.8 2.68 3.001 0.00118 **
Residuals 359 320.6 0.89
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Any help on how to approach this would be wonderful!

Is there a way to replicate the functions from panelAR with plm or another package?

I'm trying to replicate and expand on a study from 2011 that just so happens to be used as one of the demo
examples in the "panelAR" package. I don't know exactly how or why, but the demo code produces the exact same regression results from one section of the original study. One of the authors posted their replication code on their website, but it's in Stata, so I can't follow along in the "panelAR" demo code to understand how it accomplishes the same thing as the Stata code.
Here are the links to the original STATA code, article, and data.
I've been able to successfully use the "panelAR" code to run regressions with my new data, but sadly "panelAR" objects are not compatible with "stargazer", which is the package I use to make my formatted tables.
All that said, is there a way to replicate the the following code using a different panel data package or combination of packages? I've tried using "plm", "pcse", and "nmle" but with no luck.
Below is the R code that runs the first regression model:
data(LupPon)
tibble(LupPon)
# A tibble: 858 x 14
country id year redist ratio9050 ratio5010 ratio9010 skew turnout fempar propind pvoc union unempl
<chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Australia 1 1960 NA NA NA NA NA NA NA NA NA NA NA
2 Australia 1 1961 NA NA NA NA NA NA NA NA NA NA NA
3 Australia 1 1962 NA NA NA NA NA NA NA NA NA NA NA
4 Australia 1 1963 NA NA NA NA NA NA NA NA NA NA NA
5 Australia 1 1964 NA NA NA NA NA NA NA NA NA NA NA
6 Australia 1 1965 NA NA NA NA NA NA NA NA NA NA NA
7 Australia 1 1966 NA NA NA NA NA NA NA NA NA NA NA
8 Australia 1 1967 NA NA NA NA NA NA NA NA NA NA NA
9 Australia 1 1968 NA NA NA NA NA NA NA NA NA NA NA
10 Australia 1 1969 NA NA NA NA NA NA NA NA NA NA NA
# … with 848 more rows
LupPon <- LupPon[!is.na(LupPon$redist),]
LupPon$redist.lag <- unlist(by(LupPon,LupPon$id,function(x){c(NA,x[,"redist"]
[1:(length(x[,"redist"])-1)])}))
LupPon$time <- unlist(by(LupPon,LupPon$id,function(x) seq(1:nrow(x))))
out1 <- panelAR(redist ~ redist.lag + ratio9050 + ratio5010 + turnout + fempar + propind +
pvoc + union + unempl, data=LupPon, panelVar='id', timeVar='time', autoCorr='ar1',
panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
summary(out1
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 68 Avg obs. per panel 4.5333
Number of panels: 15 Max obs. per panel 9
Number of times: 9 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.26666 11.15944 -0.293 0.770776
redist.lag 0.50658 0.12652 4.004 0.000179 ***
ratio9050 3.81044 3.35976 1.134 0.261402
ratio5010 -4.76833 2.06327 -2.311 0.024405 *
turnout 0.09781 0.03644 2.684 0.009454 **
fempar 0.09134 0.05464 1.672 0.099973 .
propind 0.07253 2.54464 0.029 0.977360
pvoc 0.01860 0.03668 0.507 0.613909
union 0.08862 0.03736 2.372 0.021029 *
unempl 0.12415 0.13443 0.923 0.359580
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.8886
Wald statistic: -708.2314, Pr(>Chisq(9)): 1
Edit 2: And here's the STATA code that also runs the first regression model:
** Redistribution models (Table 2)
preserve
keep if redist!=.
sort id year
by id: egen order = seq()
tsset id order
xtpcse redist l1.redist dvpratio9050 dvpratio5010 dvturnout dvfempar dvstddisp_gall dvpvoc dvunion dvunempl, pairwise cor(ar1)
predict pred if e(sample), xb
gen resid = redist-pred
egen stresid=std(resid)
gen outlier = 0 if e(sample)
replace outlier = 1 if abs(stresid)>1.5
Edit 3: Below are the all of the code chunks for the next 7 regression models in both R and STATA
#### Regressions in R
# Removing outliers...
mod1.resid <- out1$residuals
index <- which(abs((mod1.resid-mean(mod1.resid))/sd(mod1.resid)) <= 1.5)
LupPon.nooutlier <- out1$model[index,]> out2 <- panelAR(redist ~ redist.lag + ratio9050 + ratio5010 + turnout + fempar + propind + pvoc + union + unempl, data=LupPon.nooutlier, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
The following units have non-consecutive observations. Use runs.analysis() on output for additional details: 12, 15, 16, 17, 4, 6.
Panel-specific correlations bounded to [-1,1]
summary(out2)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 58 Avg obs. per panel 3.8667
Number of panels: 15 Max obs. per panel 8
Number of times: 9 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.57080 7.27261 0.078 0.9378
redist.lag 0.49404 0.07800 6.333 7.74e-08 ***
ratio9050 6.04188 2.81801 2.144 0.0371 *
ratio5010 -6.58628 1.32426 -4.974 8.82e-06 ***
turnout 0.06427 0.02554 2.516 0.0153 *
fempar 0.07852 0.03606 2.178 0.0344 *
propind -2.46670 2.05462 -1.201 0.2358
pvoc 0.01582 0.02327 0.680 0.4999
union 0.12558 0.01634 7.686 6.59e-10 ***
unempl 0.04132 0.10911 0.379 0.7066
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.931
Wald statistic: 2323.5873, Pr(>Chisq(9)): 0
out3 <- panelAR(redist ~ ratio9050 + ratio5010 + as.factor(id), data=LupPon, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
Panel-specific correlations bounded to [-1,1]
summary(out3)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 77 Avg obs. per panel 5.1333
Number of panels: 15 Max obs. per panel 10
Number of times: 10 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.97114 9.87777 1.414 0.162412
ratio9050 14.05312 4.99515 2.813 0.006619 **
ratio5010 -8.13602 4.90541 -1.659 0.102420
as.factor(id)3 13.18767 1.50276 8.776 2.36e-12 ***
as.factor(id)4 -0.69241 2.81607 -0.246 0.806616
as.factor(id)5 11.97750 2.20502 5.432 1.07e-06 ***
as.factor(id)6 10.30933 1.94688 5.295 1.78e-06 ***
as.factor(id)7 -2.09143 1.43608 -1.456 0.150511
as.factor(id)8 -1.66623 1.03527 -1.609 0.112766
as.factor(id)9 -0.07301 2.12339 -0.034 0.972686
as.factor(id)12 6.05386 1.73534 3.489 0.000916 ***
as.factor(id)14 8.45693 1.95346 4.329 5.77e-05 ***
as.factor(id)15 13.59385 2.24826 6.046 1.03e-07 ***
as.factor(id)16 -12.92293 1.34996 -9.573 1.09e-13 ***
as.factor(id)17 -2.62601 1.37326 -1.912 0.060623 .
as.factor(id)18 -9.95612 2.26996 -4.386 4.74e-05 ***
as.factor(id)20 -13.69930 2.20810 -6.204 5.59e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.8806
Wald statistic: 189246.9165, Pr(>Chisq(16)): 0
# Removing outliers...
mod3.resid <- out3$residuals
index <- which(abs((mod3.resid-mean(mod3.resid))/sd(mod3.resid)) <= 1.5)
LupPon.nooutlier <- out3$model[index,]> out4 <- panelAR(redist ~ ratio9050 + ratio5010 + as.factor(id), data=LupPon.nooutlier, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
The following units have non-consecutive observations. Use runs.analysis() on output for additional details: 12, 15, 17, 4, 5, 6.
Panel-specific correlations bounded to [-1,1]
summary(out4)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 68 Avg obs. per panel 4.5333
Number of panels: 15 Max obs. per panel 8
Number of times: 10 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 20.6537 6.0314 3.424 0.00122 **
ratio9050 9.8536 3.2130 3.067 0.00346 **
ratio5010 -7.7280 4.0575 -1.905 0.06248 .
as.factor(id)12 6.8209 1.5622 4.366 6.19e-05 ***
as.factor(id)14 7.1422 1.6633 4.294 7.87e-05 ***
as.factor(id)15 11.7269 1.3660 8.585 1.79e-11 ***
as.factor(id)16 -13.1042 1.3083 -10.016 1.22e-13 ***
as.factor(id)17 -2.3581 1.0988 -2.146 0.03664 *
as.factor(id)18 -8.6729 1.5719 -5.518 1.16e-06 ***
as.factor(id)20 -12.3829 1.4979 -8.267 5.57e-11 ***
as.factor(id)3 12.6117 1.1372 11.091 3.37e-15 ***
as.factor(id)4 -1.8655 2.2007 -0.848 0.40057
as.factor(id)5 12.7513 0.8727 14.612 < 2e-16 ***
as.factor(id)6 8.6724 0.8584 10.102 9.10e-14 ***
as.factor(id)7 -1.1486 1.0426 -1.102 0.27575
as.factor(id)8 -1.7659 1.0488 -1.684 0.09833 .
as.factor(id)9 0.6549 1.6795 0.390 0.69822
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.9517
Wald statistic: 5785762.7106, Pr(>Chisq(16)): 0
out5 <- panelAR(redist ~ redist.lag + ratio9010 + skew + turnout + fempar + propind + pvoc + union + unempl, data=LupPon, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
Panel-specific correlations bounded to [-1,1]
summary(out5)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 68 Avg obs. per panel 4.5333
Number of panels: 15 Max obs. per panel 9
Number of times: 9 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -14.73371 9.19697 -1.602 0.114585
redist.lag 0.49211 0.12412 3.965 0.000204 ***
ratio9010 -0.01548 1.13592 -0.014 0.989172
skew 10.17135 3.67271 2.769 0.007529 **
turnout 0.10182 0.03629 2.806 0.006819 **
fempar 0.08536 0.05333 1.601 0.114901
propind -0.06816 2.45060 -0.028 0.977905
pvoc 0.01991 0.03702 0.538 0.592875
union 0.09013 0.03607 2.499 0.015316 *
unempl 0.11177 0.13563 0.824 0.413280
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.8918
Wald statistic: -151.0228, Pr(>Chisq(9)): 1
# Removing outliers...
mod5.resid <- out5$residuals
index <- which(abs((mod5.resid-mean(mod5.resid))/sd(mod5.resid)) <= 1.5)
LupPon.nooutlier <- out5$model[index,]> out6 <- panelAR(redist ~ redist.lag + ratio9010 + skew + turnout + fempar + propind + pvoc + union + unempl, data=LupPon.nooutlier, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
The following units have non-consecutive observations. Use runs.analysis() on output for additional details: 12, 15, 16, 17, 4, 6.
Panel-specific correlations bounded to [-1,1]
summary(out6)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 58 Avg obs. per panel 3.8667
Number of panels: 15 Max obs. per panel 8
Number of times: 9 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -12.43089 6.18074 -2.011 0.0499 *
redist.lag 0.48096 0.07362 6.533 3.83e-08 ***
ratio9010 -0.16200 0.94572 -0.171 0.8647
skew 12.98571 2.58573 5.022 7.48e-06 ***
turnout 0.06363 0.02581 2.466 0.0173 *
fempar 0.07440 0.03485 2.135 0.0379 *
propind -2.37649 1.93445 -1.229 0.2252
pvoc 0.01183 0.02326 0.509 0.6134
union 0.12312 0.01525 8.073 1.71e-10 ***
unempl 0.05119 0.10653 0.480 0.6331
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.9346
Wald statistic: 2147.9426, Pr(>Chisq(9)): 0
out7 <- panelAR(redist ~ ratio9010 + skew + as.factor(id), data=LupPon, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
Panel-specific correlations bounded to [-1,1]
summary(out7)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 77 Avg obs. per panel 5.1333
Number of panels: 15 Max obs. per panel 10
Number of times: 10 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.6646 8.6392 -0.540 0.591242
ratio9010 1.3439 1.5360 0.875 0.385089
skew 24.4739 7.5166 3.256 0.001860 **
as.factor(id)3 12.3092 1.3360 9.214 4.32e-13 ***
as.factor(id)4 -0.0509 2.7927 -0.018 0.985518
as.factor(id)5 11.0080 2.1338 5.159 2.95e-06 ***
as.factor(id)6 9.0069 1.9432 4.635 1.97e-05 ***
as.factor(id)7 -2.6626 1.2938 -2.058 0.043947 *
as.factor(id)8 -1.6262 0.9011 -1.805 0.076137 .
as.factor(id)9 0.6049 2.1973 0.275 0.784038
as.factor(id)12 5.9046 1.6921 3.490 0.000913 ***
as.factor(id)14 7.9706 1.7490 4.557 2.60e-05 ***
as.factor(id)15 11.9357 2.3695 5.037 4.62e-06 ***
as.factor(id)16 -12.8997 1.5345 -8.406 9.96e-12 ***
as.factor(id)17 -2.1192 1.3775 -1.538 0.129196
as.factor(id)18 -9.3785 2.2897 -4.096 0.000128 ***
as.factor(id)20 -13.1480 2.2069 -5.958 1.45e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.8874
Wald statistic: 63452.5759, Pr(>Chisq(16)): 0
# Removing outliers...
mod7.resid <- out7$residuals
index <- which(abs((mod7.resid-mean(mod7.resid))/sd(mod7.resid)) <= 1.5)
LupPon.nooutlier <- out7$model[index,]> out8 <- panelAR(redist ~ ratio9010 + skew + as.factor(id), data=LupPon.nooutlier, panelVar='id', timeVar='time', autoCorr='ar1', panelCorrMethod='pcse',rho.na.rm=TRUE, panel.weight='t-1', bound.rho=TRUE)
The following units have non-consecutive observations. Use runs.analysis() on output for additional details: 12, 15, 17, 4, 5, 6.
Panel-specific correlations bounded to [-1,1]
summary(out8)
Panel Regression with AR(1) Prais-Winsten correction and panel-corrected standard errors
Unbalanced Panel Design:
Total obs.: 67 Avg obs. per panel 4.4667
Number of panels: 15 Max obs. per panel 8
Number of times: 10 Min obs. per panel 1
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.8009 5.5166 0.689 0.49402
ratio9010 -1.5372 0.9529 -1.613 0.11298
skew 24.4161 5.8523 4.172 0.00012 ***
as.factor(id)12 6.1617 1.4439 4.267 8.80e-05 ***
as.factor(id)14 5.0717 1.4991 3.383 0.00140 **
as.factor(id)15 8.3799 1.3962 6.002 2.17e-07 ***
as.factor(id)16 -14.9084 1.2974 -11.491 1.22e-15 ***
as.factor(id)17 -0.7629 1.0720 -0.712 0.47999
as.factor(id)18 -5.5874 1.5338 -3.643 0.00064 ***
as.factor(id)20 -9.3915 1.5103 -6.218 1.00e-07 ***
as.factor(id)3 10.5130 1.0326 10.181 8.77e-14 ***
as.factor(id)4 1.4597 2.0448 0.714 0.47862
as.factor(id)5 10.6512 0.8865 12.015 2.35e-16 ***
as.factor(id)6 6.2242 1.0314 6.035 1.93e-07 ***
as.factor(id)7 -1.5296 0.7421 -2.061 0.04451 *
as.factor(id)8 -1.7578 0.7908 -2.223 0.03079 *
as.factor(id)9 3.7324 1.7773 2.100 0.04079 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-squared: 0.9683
Wald statistic: 1080793.3476, Pr(>Chisq(16)): 0
# Regressions in STATA
xtpcse redist l1.redist dvpratio9050 dvpratio5010 dvturnout dvfempar dvstddisp_gall dvpvoc dvunion dvunempl if outlier!=1, pairwise cor(ar1) hetonly
drop pred resid stresid outlier
xi: xtpcse redist dvpratio9050 dvpratio5010 i.id, pairwise cor(ar1)
predict pred if e(sample), xb
gen resid = redist -pred
egen stresid=std(resid)
gen outlier = 0 if e(sample)
replace outlier = 1 if abs(stresid)>1.5
xi: xtpcse redist dvpratio9050 dvpratio5010 i.id if outlier!=1, pairwise cor(ar1)
drop pred resid stresid outlier
xtpcse redist l1.redist dvratio9010 dvskew dvturnout dvfempar dvstddisp_gall dvpvoc dvunion dvunempl, pairwise cor(ar1)
predict pred if e(sample), xb
gen resid = redist -pred
egen stresid=std(resid)
gen outlier = 0 if e(sample)
replace outlier = 1 if abs(stresid)>1.5
xtpcse redist l1.redist dvratio9010 dvskew dvturnout dvfempar dvstddisp_gall dvpvoc dvunion dvunempl if outlier!=1, pairwise cor(ar1)
drop pred resid stresid outlier
xi: xtpcse redist dvratio9010 dvskew i.id, pairwise cor(ar1)
predict pred if e(sample), xb
gen resid = redist -pred
egen stresid=std(resid)
gen outlier = 0 if e(sample)
replace outlier = 1 if abs(stresid)>1.5
xi: xtpcse redist dvratio9010 dvskew i.id if outlier!=1, pairwise cor(ar1)
drop pred resid stresid outlier
restore

If your main concern is to export your formatted tables, then it is much easier to replace stargazer with texreg instead of replicating the results using other packages. From my experience, texreg package supports panelAR objects. For example: screenreg(list(panelAR_model1, panelAR_model2)) will show a formatted table in the console, so you can check the output, while texreg(list(panelAR_model1, panelAR_model2)) will produce a latex table. See the texreg documentation for customization.

How do I use the glm() function?

I'm trying to fit a general linear model (GLM) on my data using R. I have a Y continuous variable and two categorical factors, A and B. Each factor is coded as 0 or 1, for presence or absence.
Even if just looking at the data I see a clear interaction between A and B, the GLM says that p-value>>>0.05. Am I doing something wrong?
First of all I create the data frame including my data for the GLM, which consists on a Y dependent variable and two factors, A and B. These are two level factors (0 and 1). There are 3 replicates per combination.
A<-c(0,0,0,1,1,1,0,0,0,1,1,1)
B<-c(0,0,0,0,0,0,1,1,1,1,1,1)
Y<-c(0.90,0.87,0.93,0.85,0.98,0.96,0.56,0.58,0.59,0.02,0.03,0.04)
my_data<-data.frame(A,B,Y)
Let’s see how it looks like:
my_data
## A B Y
## 1 0 0 0.90
## 2 0 0 0.87
## 3 0 0 0.93
## 4 1 0 0.85
## 5 1 0 0.98
## 6 1 0 0.96
## 7 0 1 0.56
## 8 0 1 0.58
## 9 0 1 0.59
## 10 1 1 0.02
## 11 1 1 0.03
## 12 1 1 0.04
As we can see just looking on the data, there is a clear interaction between factor A and factor B, as the value of Y dramatically decreases when A and B are present (that is A=1 and B=1). However, using the glm function I get no significant interaction between A and B, as p-value>>>0.05
attach(my_data)
## The following objects are masked _by_ .GlobalEnv:
##
## A, B, Y
my_glm<-glm(Y~A+B+A*B,data=my_data,family=binomial)
## Warning: non-integer #successes in a binomial glm!
summary(my_glm)
##
## Call:
## glm(formula = Y ~ A + B + A * B, family = binomial, data = my_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.275191 -0.040838 0.003374 0.068165 0.229196
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.1972 1.9245 1.142 0.254
## A 0.3895 2.9705 0.131 0.896
## B -1.8881 2.2515 -0.839 0.402
## A:B -4.1747 4.6523 -0.897 0.370
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7.86365 on 11 degrees of freedom
## Residual deviance: 0.17364 on 8 degrees of freedom
## AIC: 12.553
##
## Number of Fisher Scoring iterations: 6

While you state Y is continuous, the data shows that Y is rather a fraction. Hence, probably the reason you tried to apply GLM in the first place.
To model fractions (i.e. continuous values bounded by 0 and 1) can be done with logistic regression if certain assumptions are fullfilled. See the following cross-validated post for details: https://stats.stackexchange.com/questions/26762/how-to-do-logistic-regression-in-r-when-outcome-is-fractional. However, from the data description it is not clear that those assumptions are fullfilled.
An alternative to model fractions are beta regression or fractional repsonse models.
See below how to apply those methods to your data. The results of both methods are consistent in terms of signs and significance.
# Beta regression
install.packages("betareg")
library("betareg")
result.betareg <-betareg(Y~A+B+A*B,data=my_data)
summary(result.betareg)
# Call:
# betareg(formula = Y ~ A + B + A * B, data = my_data)
#
# Standardized weighted residuals 2:
# Min 1Q Median 3Q Max
# -2.7073 -0.4227 0.0682 0.5574 2.1586
#
# Coefficients (mean model with logit link):
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.1666 0.2192 9.885 < 2e-16 ***
# A 0.6471 0.3541 1.828 0.0676 .
# B -1.8617 0.2583 -7.206 5.76e-13 ***
# A:B -4.2632 0.5156 -8.268 < 2e-16 ***
#
# Phi coefficients (precision model with identity link):
# Estimate Std. Error z value Pr(>|z|)
# (phi) 71.57 29.50 2.426 0.0153 *
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#
# Type of estimator: ML (maximum likelihood)
# Log-likelihood: 24.56 on 5 Df
# Pseudo R-squared: 0.9626
# Number of iterations: 62 (BFGS) + 2 (Fisher scoring)
# ----------------------------------------------------------
# Fractional response model
install.packages("frm")
library("frm")
frm(Y,cbind(A, B, AB=A*B),linkfrac="logit")
*** Fractional logit regression model ***
# Estimate Std. Error t value Pr(>|t|)
# INTERCEPT 2.197225 0.157135 13.983 0.000 ***
# A 0.389465 0.530684 0.734 0.463
# B -1.888120 0.159879 -11.810 0.000 ***
# AB -4.174668 0.555642 -7.513 0.000 ***
#
# Note: robust standard errors
#
# Number of observations: 12
# R-squared: 0.992

The family=binomial implies Logit (Logistic) Regression, which is itself produces a binary result.
From Quick-R
Logistic Regression
Logistic regression is useful when you are predicting a binary outcome
from a set of continuous predictor variables. It is frequently
preferred over discriminant function analysis because of its less
restrictive assumptions.

The data shows an interaction. Try to fit a different model, logistic is not appropriate.
with(my_data, interaction.plot(A, B, Y, fixed = TRUE, col = 2:3, type = "l"))
An analysis of variance shows clear significance for all factors and interaction.
fit <- aov(Y~(A*B),data=my_data)
summary(fit)
Df Sum Sq Mean Sq F value Pr(>F)
A 1 0.2002 0.2002 130.6 3.11e-06 ***
B 1 1.1224 1.1224 732.0 3.75e-09 ***
A:B 1 0.2494 0.2494 162.7 1.35e-06 ***
Residuals 8 0.0123 0.0015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Polynomial regression line through origin with equation in calibration curve

I would like a second order(? is it) regression line plotted through zero and crucially I need the equation for this relationship.
Here's my data:
ecoli_ug_ml A420 rpt
1 0 0.000 1
2 10 0.129 1
3 20 0.257 1
4 30 0.379 1
5 40 0.479 1
6 50 0.579 1
7 60 0.673 1
8 70 0.758 1
9 80 0.838 1
10 90 0.912 1
11 100 0.976 1
12 0 0.000 2
13 10 0.126 2
14 20 0.257 2
15 30 0.382 2
16 40 0.490 2
17 50 0.592 2
18 60 0.684 2
19 70 0.772 2
20 80 0.847 2
21 90 0.917 2
22 100 0.977 2
23 0 0.000 3
24 10 0.125 3
25 20 0.258 3
26 30 0.376 3
27 40 0.488 3
28 50 0.582 3
29 60 0.681 3
30 70 0.768 3
31 80 0.846 3
32 90 0.915 3
33 100 0.977 3
My plot looks like this: (sci2 is just some axis and text formatting, can include if necessary)
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x - 1,2)) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u."))) +
sci2
When I view this, the fit of this line to the points is spectacularly good.
When I check out coef, I think there is non-zero y-intercept (which is unacceptable for my purposes) but to be honest I don't really understand these lines:
coef(lm(A420 ~ poly(ecoli_ug_ml, 2, raw=TRUE), data = calib))
(Intercept) poly(ecoli_ug_ml, 2, raw = TRUE)1
-1.979021e-03 1.374789e-02
poly(ecoli_ug_ml, 2, raw = TRUE)2
-3.964258e-05
Therefore, I assume the plot is actually not quite right either.
So, what I need is to generate a regression line forced through zero and get the equation for it, and, understand what it's saying when it gives me said equation. If I could annotate the plot area with the equation directly I would be incredibly stoked.
I have spent approximately 8 hours trying to work this out now, I checked excel and got a formula in 8 seconds but I would really like to get into using R for this. Thanks!
To clarify: the primary purpose of this plot is not to demonstrate the distribution of these data but rather to provide a visual confirmation that the equation I generate from these points fits the readings well

summary(lm(A420~poly(ecoli_ug_ml,2,raw=T),data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T), data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -1.979e-03 1.926e-03 -1.028 0.312
# poly(ecoli_ug_ml, 2, raw = T)1 1.375e-02 8.961e-05 153.419 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.964e-05 8.631e-07 -45.932 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004379 on 30 degrees of freedom
# Multiple R-squared: 0.9998, Adjusted R-squared: 0.9998
# F-statistic: 8.343e+04 on 2 and 30 DF, p-value: < 2.2e-16
So the intercept is not exactly 0 but it is small compared to the Std. Error. In other words, the intercept is not significantly different from 0.
You can force a fit without the intercept this way (note the -1 in the formula):
summary(lm(A420~poly(ecoli_ug_ml,2,raw=T)-1,data=calib))
# Call:
# lm(formula = A420 ~ poly(ecoli_ug_ml, 2, raw = T) - 1, data = calib)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# poly(ecoli_ug_ml, 2, raw = T)1 1.367e-02 5.188e-05 263.54 <2e-16 ***
# poly(ecoli_ug_ml, 2, raw = T)2 -3.905e-05 6.396e-07 -61.05 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# Residual standard error: 0.004383 on 31 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 3.4e+05 on 2 and 31 DF, p-value: < 2.2e-16
Note that the coefficients do not change appreciably.
EDIT (Response to OP's comment)
The formula specified in stat_smooth(...) is just passed directly to the lm(...) function, so you can specify in stat_smooth(...) any formula that works in lm(...). The point of the results above is that, even without forcing the intercept to 0, it is extremely small (-2e-3) compared to the range in y (0-1), so plotting curves with and without will give nearly indistinguishable results. You can see this for yourself by running this code:
ggplot(calib, aes(ecoli_ug_ml, A420)) +
geom_point(shape=calib$rpt) +
stat_smooth(method="lm", formula=y~poly(x,2,raw=T),colour="red") +
stat_smooth(method="lm", formula=y~-1+poly(x,2,raw=T),colour="blue") +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))
The blue and red curves are nearly, but not quite on top of each other (you may have to open up your plot window to see it). And no, you do not have to do this "outside of ggplot."
The problem you reported relates to using the default raw=F. This causes poly(...) to use orthogonal polynomials, which by definition have constant terms. So using y~-1+poly(x,2) doesn't really make sense, whereas using y~-1+poly(x,2,raw=T) does make sense.
Finally, if all this business of using poly(...) with or without raw=T is causing confusion, you can achieve the exact same result using formula = y~ -1 + x + I(x^2). This fits a second order polynomial (a*x +b*x^2) and suppresses the constant term.

I think you are misinterpreting that Intercept and also how stat_smooth works. Polynomial fits done by statisticians typically do not use the raw=TRUE parameter. The default is FALSE and the polynomials are constructed to be orthogonal to allow proper statistical assessment of the fit improvement when looking at the standard errors. It is instructive to look at what happens if you attempt to eliminate the Intercept by using -1 or 0+ in the formula. Try with your data and code to get rid of the intercept:
....+
stat_smooth(method="lm", formula=y~0+poly(x - 1,2)) + ...
You will see the fitted line intercepting the y axis at -0.5 and change. Now look at the non-raw value of the intercept.
coef(lm(A420~poly(ecoli_ug_ml,2),data=ecoli))
(Intercept) poly(ecoli_ug_ml, 2)1 poly(ecoli_ug_ml, 2)2
0.5466667 1.7772858 -0.2011251
So the intercept is shifting the whole curve upward to let the polynomial fit have the best fitting curvature. If you want to draw a line with ggplot2 that meets some different specification you should calculate it outside of ggplot2 and then plot it without the error bands because it really won't have the proper statistical properties.
Nonetheless, this is the way to apply what in this case is a trivial amount of adjustment and I am offering it only as an illustration of how to add an externally derived set of values. I think _ad_hoc_ adjustments like this are dangerous in practice:
mod <- lm(A420~poly(ecoli_ug_ml,2), data=ecoli)
ecoli$shifted_pred <- predict(mod) - predict( mod, newdata=list(ecoli_ug_ml=0))
ggplot(ecoli, aes(ecoli_ug_ml, A420)) +
geom_point(shape=ecoli$rpt) +
scale_x_continuous(expression(paste(italic("E. coli"),~"concentration, " ,mu,g~mL^-1,))) +
scale_y_continuous(expression(paste(Absorbance["420nm"], ~ ", a.u.")))+
geom_line(data=ecoli, aes(x= ecoli_ug_ml, y=shifted_pred ) )

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Robust regression p values with lmrob - r

Related

Plotting mean and standard error of mean from linear regression

How to use ANOVA (function = aov) for a fractional factorial designed experiment with 6 independent variables in R

Is there a way to replicate the functions from panelAR with plm or another package?

How do I use the glm() function?

Polynomial regression line through origin with equation in calibration curve

Categories

Resources