Precision in summary output of lm R - r

I am doing some exercises using package r-exams, in which I print a summary from an lm object and ask students things like, “which is the estimated value of the intercept”. The idea is that the student copies the values of the summary output and use that value as the correct answer. The issue here is that I use the values from coef() function as the correct answers, but this is not a good idea since the precision of these values are quite different from the precision of the values shown in the summary output. Here is an example:
set.seed(123)
library(tidyverse)
## DATA GENERATION
xbreaks<-c(runif(1,4,4.8),runif(1,6,6.9),runif(1,7.8,8.5),runif(1,9,10))
ybreaks<-c(runif(1,500,1000),runif(1,1800,4000),runif(1,200,800))
b11<-(ybreaks[2]-ybreaks[1])/(xbreaks[2]-xbreaks[1])
b10<-ybreaks[1]-b11*xbreaks[1]
b31<-(ybreaks[3]-ybreaks[2])/(xbreaks[4]-xbreaks[3])
b30<-ybreaks[2]-b31*xbreaks[3]
points_df<-data.frame(x=xbreaks,y=ybreaks[c(1,2,2,3)])
n<-rpois(3,120)
x1<-runif(n[1],xbreaks[1],xbreaks[2])
x2<-runif(n[2],xbreaks[2],xbreaks[3])
x3<-runif(n[3],xbreaks[3],xbreaks[4])
y<-c(b10+b11*x1+rnorm(n[1],0,200),
ybreaks[2]+rnorm(n[2],0,200),
b30+b31*x3+rnorm(n[3],0,200))
z0_aw<-data.frame(ph=c(x1,x2,x3),UFC=y,case=factor(c(rep(1,n[1]),rep(2,n[2]),rep(3,n[3]))))
mean_x<-z0_aw$ph%>% mean %>% round(2)
caserng<-sample(1:4,1)
modrng<-sample(1:2,1)
if(caserng!=4){
z0_aw<-z0_aw[z0_aw$case == caserng,]
}
if(modrng==1){
m0<-lm(UFC~ph,data=z0_aw)
}else{
cl <- call("lm", formula = UFC ~ I(ph - mean_x), data = as.name("z0_aw"))
cl$formula[[3]][[2]][[3]] <- mean_x
m0<-eval(cl)
}
summary(m0)
#>
#> Call:
#> lm(formula = UFC ~ I(ph - 7.2), data = z0_aw)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -555.53 -121.98 5.46 115.38 457.08
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2726.86 57.33 47.57 <2e-16 ***
#> I(ph - 7.2) -840.05 31.46 -26.70 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 182.7 on 116 degrees of freedom
#> Multiple R-squared: 0.8601, Adjusted R-squared: 0.8589
#> F-statistic: 713.1 on 1 and 116 DF, p-value: < 2.2e-16
coef(m0)
#> (Intercept) I(ph - 7.2)
#> 2726.8605 -840.0515
Created on 2021-05-14 by the reprex package (v2.0.0)
Suppose that extol: 0.0001 in r-exams is set, and the student is asked to give the estimated value of the intercept. The student will get a wrong answer since he will answer 2726.86 but the correct answer from coef is 2726.8605 .
As can be seen, output of summary uses 2 decimals, whereas coef() values has quite more precision. I want to know how many decimals is summary using in order to apply the same format to values produced by coef(). This will ensure that the answer provided by the student is the same as the summary output.
I just want to do this:
answers<-coef(m0) %>% format(digits=dsum) %>% as.numeric()
where dsum is the number of digits used also by the summary output.
Note: retain a precision of 4 decimals is needed since I also ask students about the R-squared value provided in the same summary output, so it is not a good idea to set extol: 0.01 for example. Also the problems are generated at random and the magnitude of the estimated coefficients changes, as I have noted that this is directly related to the precision used in summary output.

Some useful information for such questions in R/exams:
The extol can also be a vector so that you can set different tolerances for coefficients and R-squared etc.
When asking about the R-squared, though, I typically ask for it "in percent". Then the same tolerance may be suitable as for the coefficients.
I would recommend to control the size of the coefficients suitably so that digits and extol can be set accordingly.
Personally, I typically store the exsolution at a higher precision than I request from the students. For example, exsolution could be 12.345678 while I only set extol to 0.01. This makes sure that when the correct answer is rounded to two decimal places it is inside the correct interval determined by exsolution and extol.
Details on formatting of the coefficients in the summary:
It is not obvious where exactly the formatting happens: The summary() method for lm objects returns an object of class summary.lm which has its own print() method which in turn calls printCoefmat(). The latter is the function that does the actual formatting.
When setting the digits in these functions, this controls the number of significant digits and not the number of decimal places. This is particularly important when the coefficients become relatively large (say, in the thousands or more).
The coefficients are not formatted individually but jointly with the corresponding standard errors. The details depend on the digits, the size of both coefficients and standard errors, and whether any coefficients are aliased or exactly zero etc.
Without aliased/zero coefficients the formatting from summary(m0) can be replicated using format_coef(m0) as defined below. That's essentially the boiled-down code from printCoefmat().
format_coef <- function(object, digits = max(3L, getOption("digits") - 2L)) {
coef_se <- summary(object)$coefficients[, 1L:2L]
digmin <- 1L + floor(log10(range(abs(coef_se))))
format(round(coef_se, max(1L, digits - digmin)), digits = digits)[, 1L]
}

Related

Markov Switching Regression: Standard errors of the msmFit and receiving Latex Output

These are my first two questions to ask on Stackoverflow - hopefully, I ask my questions the right way:
First Question: Standard Error
I am not entirely sure how the standard errors are specified using the "MSwM" package in R. As my data suffers from autocorrelation and heteroskedasticity, I am not sure how I can counteract these issues. I tried to do the following, but unfortunately, it did not work out the way I wanted (I want to use either Newey West standard errors or HAC standard errors):
fit1 <-lm(wage ~ educ + familystatus + region)
fit2 <- coeftest(fit1,vcov=NeweyWest(fit1,verbose=T))
mod.mswm = msmFit(fit2, k = 2, sw = rep(TRUE, 5), control=list(parallel=FALSE))
summary(mod.mswm)
Unfortunately, I receive the following error:
unable to find an inherited method for function ‘msmFit’ for signature
‘"coeftest", "numeric", "logical", "missing", "missing", "missing"’
Does anyone know how I could specify my desired standard errors using the "MSwM" package or is it not necessary at all?
Second Question: Latex Output
I do have multiple Markov Switching regressions in R (20 regressions in total). Thus, I am looking for a neat way to receive latex tables by using the "Stargazer" or "TexReg" package for example. I am afraid that this might not be possible since the "MSwM" package might not be available for "Stargazer" or "TexReg".
Here is some sample code I want to get Latex output from:
ms = msmFit(ols, k = 2, sw = rep(TRUE, 4))
summary(ms) # Obtaining the results for the first Markov-Regime Regression
The coefficients are reported as follows (using some example code):
Coefficients:
Regime 1
---------
Estimate Std. Error t value Pr(>|t|)
(Intercept)(S) 0.8417 0.3025 2.7825 0.005394 **
x(S) -0.0533 0.1340 -0.3978 0.690778
y_1(S) 0.9208 0.0306 30.0915 < 2.2e-16 ***
---
Signif. codes: 0 ´S***ˇS 0.001 ´S**ˇS 0.01 ´S*ˇS 0.05 ´S.ˇS 0.1 ´S ˇS 1
Residual standard error: 0.5034675
Multiple R-squared: 0.8375
Standardized Residuals:
Min Q1 Med Q3 Max
-1.5153666657 -0.0906543311 0.0001873641 0.1656717256 1.2020898986
Regime 2
---------
Estimate Std. Error t value Pr(>|t|)
(Intercept)(S) 8.6393 0.7244 11.9261 < 2.2e-16 ***
x(S) 1.8771 0.3107 6.0415 1.527e-09 ***
y_1(S) -0.0569 0.0797 -0.7139 0.4753
---
Signif. codes: 0 ´S***ˇS 0.001 ´S**ˇS 0.01 ´S*ˇS 0.05 ´S.ˇS 0.1 ´S ˇS 1
Residual standard error: 0.9339683
Multiple R-squared: 0.2408
Standardized Residuals:
Min Q1 Med Q3 Max
-2.31102193 -0.03317756 0.01034139 0.04509105 2.85245598
Transition probabilities:
Regime 1 Regime 2
Regime 1 0.98499728 0.02290884
Regime 2 0.01500272 0.97709116
Is there any way to get a Latex output using Stargazer or any other available packages? If so, how do I have to specify the parameters of the corresponding package?
Any help is highly appreciated. Thank you very much in advance.
I am dealing with the same issue. I am not sure whether this solves the problem, but note that the output of the model is given in a class named MSM.lm. Given such an object, you can extract certain items in a data frame or matrix format. For instance, you can output the estimated coefficient along with the standard error in a latex format as follows
library(xtable)
xtable(cbind(mod.mswm#Coef,mod.mswm#seCoef))

Creating syntactically valid names from a factor in R while retaining levels

I am making a bioinformatics shiny app that reads user-supplied group names from an excel file. As these names can be non-sytactically valid names, I would like to represent them internally as valid names.
As an example, I can have this input:
(grps <- as.factor(c("T=0","T=0","T=4-","T=4+","T=4+")))
[1] T=0 T=0 T=4- T=4+ T=4+
Levels: T=0 T=4- T=4+
Ideally, I would like R to make valid names, but keep the groups/levels the same, for instance the following would be fine:
"T.0" "T.0" "T.4minus" "T.4plus" "T.4plus"
When using make.names() however, all non-valid characters are converted to the same charater:
(grps2 <- as.factor(make.names(grps)))
[1] T.0 T.0 T.4. T.4. T.4.
Levels: T.0 T.4.
So both T=4- and T=4+ are given the same name and a level is lost (which causes problems in subsequent analyses). Also, setting unique=TRUE does not solve the problem, because
(grps3 <- as.factor(make.names(grps,unique=TRUE)))
[1] T.0 T.0.1 T.4. T.4..1 T.4..2
Levels: T.0 T.0.1 T.4. T.4..1 T.4..2
and group T=4+ is split into 2 different groups and levels are gained.
Does anybody know how it is possible in general to make a factor into valid names, while keeping the same levels?
Please keep in mind that user input can widely vary, so manually replacing "-" with "minus" does not work here.
Thanks in advance for your help!
With the mapvalues function from plyr you can do:
require("plyr")
mapvalues(grps, levels(grps), make.names(levels(grps), unique=TRUE))
Since this works directly on the levels instead of the factor, the number of the values stays the same.
The labels associated with the levels of a factor are not required to fit the same expectations of object names. Consider the following example, where I rename the gear columns of the mtcars data set, make it a factor, and give it the same levels as you have given in your example.
library(magrittr)
library(dplyr)
library(broom)
D <- mtcars[c("mpg", "gear")] %>%
setNames(c("y", "grps")) %>%
mutate(grps = factor(grps, 3:5, c("T=0", "T=4-", "T=4+")))
Notice that I am able to fit a linear model, get a summary, force it to a data frame, all while the level names have the =, -, and + symbols in them.
fit <- lm(y ~ grps, data = D)
fit
Call:
lm(formula = y ~ grps, data = D)
Coefficients:
(Intercept) grpsT=4- grpsT=4+
16.107 8.427 5.273
summary(fit)
Call:
lm(formula = y ~ grps, data = D)
Residuals:
Min 1Q Median 3Q Max
-6.7333 -3.2333 -0.9067 2.8483 9.3667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.107 1.216 13.250 7.87e-14 ***
grpsT=4- 8.427 1.823 4.621 7.26e-05 ***
grpsT=4+ 5.273 2.431 2.169 0.0384 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
tidy(fit)
term estimate std.error statistic p.value
1 (Intercept) 16.106667 1.215611 13.249852 7.867272e-14
2 grpsT=4- 8.426667 1.823417 4.621361 7.257382e-05
3 grpsT=4+ 5.273333 2.431222 2.169005 3.842222e-02
So I'm left thinking that either
You're making things harder on yourself than you need to, or
It isn't clear why you need to make the levels syntactically valid object names.

achieved convergence tolerance (and other outputs) from nls

So, I am using nls() to do nonlinear regression in R.
I now have some code which does it for me and I get the correct output (phew!).
I can easily store the coefficients in a data frame using <- coeff(), but I also need to store some of the other data from the summary too.
Here's what I get when I run summary(Power.model)
Formula: Power.mean ~ a + (b * (Power.rep^-c))
Parameters:
Estimate Std. Error t value Pr(>|t|)
a 1240.197 4.075 304.358 <2e-16 ***
b 10.400 14.550 0.715 0.490
c 6.829 230.336 0.030 0.977
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 13.97 on 11 degrees of freedom
Number of iterations to convergence: 17
Achieved convergence tolerance: 4.011e-06
I can get the Estimates and calculate the Residual sum of squares, but I would really like to also store std.error, t value, residual std error, number of iterations and (most important of all) the achieved convergence tolerance in the table too.
I understand that I can use capture.output(summary(Power.model)) to capture these, but I just end up with a bunch of strings. What I really want is to capture only the numbers (ideally as numbers) without (a) all of the extras (e.g., the string "Achieved convergence tolerance: ") and (b) without having to convert the strings into regular (single/double) numbers (e.g., 4.011e-06 into 0.000004011).
I can't seem to find a list of all of the functions I can run on my nls output. The only ones I have found so far are coeff() and resid(). A list would be ideal, but otherwise any other advice on accessing the data in the summary without resorting to capture.output() and the string editing/conversion that would inevitably follow would be very much appreciated.
coef(summary(Power.model)) will give a matrix containing some of these items and Power.model$convInfo will give a list whose components contains other of these items. The residual sum of squares can be obtained using deviance(Power.model).
methods(class = "nls") will give a list of functions that act on "nls" objects and str(Power.model) and str(summary(Power.model))will show the internal components of "nls" and "summary.nls" objects.
For example, using the builtin BOD data frame:
> fm <- nls(demand ~ a + b * Time, BOD, start = list(a = 1, b = 1))
> coef(summary(fm))
Estimate Std. Error t value Pr(>|t|)
a 8.521429 2.6589490 3.204811 0.03275033
b 1.721429 0.6386589 2.695380 0.05435392
> fm$convInfo
$isConv
[1] TRUE
$finIter
[1] 1
$finTol
[1] 3.966571e-09
$stopCode
[1] 0
$stopMessage
[1] "converged"
> deviance(fm)
[1] 38.06929
> sum(resid(fm)^2) # same
[1] 38.06929
You might also be interested in the broom package which will provide data frame representations of nls output like this:
> library(broom)
> tidy(fm)
term estimate std.error statistic p.value
1 a 8.521429 2.6589490 3.204811 0.03275033
2 b 1.721429 0.6386589 2.695380 0.05435392
> glance(fm)
sigma isConv finTol logLik AIC BIC deviance df.residual
1 3.085016 TRUE 3.966571e-09 -14.05658 34.11315 33.48843 38.06929 4
use names(Power.model), it will returns you the names of the object and you can also use names(Power.model$...), with ... one of the names of Power.model.
For example, Power.model$convInfo$finTol returns the Achieved convergence Tolerance.
If you are using RStudio you can click on the arrow near Power.model in the Environment window and it will displays all the names of Power.model with the value, which allow you to choose the correct name.

Passing strings as variables names in R for loop, but keeping names in results

Ok, I'm working on a silly toy problem in R (part of an edx course actually), running a bunch of bivariate logits and look at the p values. And I'm trying to add some coding practice to my data crunching practice by doing the chore as a for loop rather than as a bunch of individual models. So I pulled the variable names I wanted out of the data frame, stuck that in a vector, and passed that vector to glm() with a for loop.
After about an hour and a half of searching and hacking around to deal with the inevitable variable length errors, I realized that R was interpreting the elements of the variable vector as character strings rather than variable names. Solved that problem, ended up with a final working loop as follows:
for (i in 1:length(dumber)) {
print(summary(glm(WorldSeries ~ get(dumber[i]) , data=baseball, family=binomial)))
}
where dumber is the vector of independent variable names, WorldSeries is the dependent variable.
And that was awesome... except for one little problem. The console output is a bunch of model summaries, which is what I want, but the summaries aren't labelled with the variable names. Instead, they're just labelled with the code from the for loop! For example, here are the summaries for two of the variables my little loop went through:
Call:
glm(formula = WorldSeries ~ get(dumber[i]), family = binomial,
data = baseball)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.5610 -0.5209 -0.5088 -0.4902 2.1268
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.08725 6.07285 -0.014 0.989
get(dumber[i]) -4.65992 15.06881 -0.309 0.757
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 84.926 on 113 degrees of freedom
Residual deviance: 84.830 on 112 degrees of freedom
(130 observations deleted due to missingness)
AIC: 88.83
Number of Fisher Scoring iterations: 4
Call:
glm(formula = WorldSeries ~ get(dumber[i]), family = binomial,
data = baseball)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9871 -0.8017 -0.5089 -0.5089 2.2643
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.03868 0.43750 0.088 0.929559
get(dumber[i]) -0.25220 0.07422 -3.398 0.000678 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 239.12 on 243 degrees of freedom
Residual deviance: 226.96 on 242 degrees of freedom
AIC: 230.96
Number of Fisher Scoring iterations: 4
That's obviously hopeless, especially as the number of elements of the variable vector increases. I'm sure if I knew a lot more about object-oriented programming than I do, I'd be able to just create some kind of complicated object that has the elements of dumber matched to the model summaries, or directly tinker with the summaries to insert the elements of dumber into where it currently just reads "get(dumber[i])". But I currently know jack-all about OOP (I'm learning! It's slow!). So does anyone wanna clue me in? Thanks!
You could do this (only send the outcome and predictor columns one at a time to glm):
for (i in 1:length(dumber)) {
print(summary(glm(WorldSeries ~ . , data=baseball[, c("WorldSeries", dumber[i])],
family=binomial)))
}
You could also do this (label the outputs with the value of 'dumber'):
for (i in 1:length(dumber)) { print( paste0("Current predictor is ...", dumber))
print(summary(glm(WorldSeries ~ get(dumber) , data=baseball, family=binomial)))
}
As you progress down the road to R mastery, you would probably want to build a list of summary objects and then use lapply to print or cat your tailored output.

Regression analysis or Anova?

I hope to be the clearest I can.
Let's say I have a dataset with 10 variables, where 4 of them represent for me a certain phenomenon that I call Y.
The other 6 represent for me another phenomenon that I call X.
Each one of those variables (10) contains 37 units. Those units are just the respondents of my analysis (a survey).
Since all the questions are based on a Likert scale, they are qualitative variables. The scale is from 0 to 7 for all of them, but there are "-1" and "-2" values where the answer is missing. Hence the scale goes actually from -2 to 7.
What I want to do is to calculate the regression between my Y (which contains 4 variables in this case and 37 answers for each variable) and my X (which contains 6 variables instead and the same number of respondents). I know that for qualitative analyses I should use Anova instead of the regression, although I have read somewhere that it is even possible
to make the regression.
Until now I have tried to act this way:
> apply(Y, 1, function(Y) mean(Y[Y>0])) #calculate the average per rows (respondents) without considering the negative values
> Y.reg<- c(apply(Y, 1, function(Y) mean(Y[Y>0]))) #create the vector Y, thus it results like 1 variable with 37 numbers
> apply(X, 1, function(X) mean(X[X>0]))
> X.reg<- c(apply(X, 1, function(X) mean(X[X>0]))) #create the vector X, thus it results like 1 variable with 37 numbers
> reg1<- lm(Y.reg~ X.reg) #make the first regression
> summary(reg1) #see the results
Call:
lm(formula = Y.reg ~ X.reg)
Residuals:
Min 1Q Median 3Q Max
-2.26183 -0.49434 -0.02658 0.37260 2.08899
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.2577 0.4986 8.539 4.46e-10 ***
X.reg 0.1008 0.1282 0.786 0.437
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7827 on 35 degrees of freedom
Multiple R-squared: 0.01736, Adjusted R-squared: -0.01072
F-statistic: 0.6182 on 1 and 35 DF, p-value: 0.437
But as you can see, although I do not use Y as composed by 4 variables and X by 6, and I do not consider the negative values too, I get a very low score as my R^2.
If I act with anova instead I have this problem:
> Ymatrix<- as.matrix(Y)
> Xmatrix<- as.matrix(X) #where both this Y and X are in their first form, thus composed by more variables (4 and 6) and with negative values as well.
> Errore in UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "c('matrix', 'integer', 'numeric')"
To be honest, a few days ago I succeeded in using anova, but unfortunately I do not remember how and I did not save the commands anywhere.
What I would like to know is:
First of all, am I wrong in how I approach to my problem?
What do you think about the regression output?
Finally, how can I do to make the anova? If I have to do it.
If your response (Y) and predictor (x) are numeric scale, you can use regression.
If your response (Y) is numeric scale with predictor (x) is categoric scale, you can use ANOVA.
Suggested:
you have to use validity and reliability test to know if the answers (indicators) are valid and reliable for response and predictor, before you use regression method.
I disagree with Denny's answer. You can use either approach regardless of the type of data that you have. If you have categorical data you can express it as numeric using dummy encoding. For example given a feature x with 3 options, say 1, 2, and 3, you can encode this as numeric by creating 3 new additional variables x1, x2, and x3. If x is 1 x1 will be 1, x2 will be 0, and x3 will be zero. If x is missing, the three new x values will all be zero.
In your case I would recommend that you try regression first because of the amount of features that you have and because it tends to be straight forward. ANOVA can become complicated as the number of the features increases. Both should work, assuming your data meets the assumptions required by both techniques.

Resources