Regression analysis or Anova? - r

I hope to be the clearest I can.
Let's say I have a dataset with 10 variables, where 4 of them represent for me a certain phenomenon that I call Y.
The other 6 represent for me another phenomenon that I call X.
Each one of those variables (10) contains 37 units. Those units are just the respondents of my analysis (a survey).
Since all the questions are based on a Likert scale, they are qualitative variables. The scale is from 0 to 7 for all of them, but there are "-1" and "-2" values where the answer is missing. Hence the scale goes actually from -2 to 7.
What I want to do is to calculate the regression between my Y (which contains 4 variables in this case and 37 answers for each variable) and my X (which contains 6 variables instead and the same number of respondents). I know that for qualitative analyses I should use Anova instead of the regression, although I have read somewhere that it is even possible
to make the regression.
Until now I have tried to act this way:
> apply(Y, 1, function(Y) mean(Y[Y>0])) #calculate the average per rows (respondents) without considering the negative values
> Y.reg<- c(apply(Y, 1, function(Y) mean(Y[Y>0]))) #create the vector Y, thus it results like 1 variable with 37 numbers
> apply(X, 1, function(X) mean(X[X>0]))
> X.reg<- c(apply(X, 1, function(X) mean(X[X>0]))) #create the vector X, thus it results like 1 variable with 37 numbers
> reg1<- lm(Y.reg~ X.reg) #make the first regression
> summary(reg1) #see the results
Call:
lm(formula = Y.reg ~ X.reg)
Residuals:
Min 1Q Median 3Q Max
-2.26183 -0.49434 -0.02658 0.37260 2.08899
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.2577 0.4986 8.539 4.46e-10 ***
X.reg 0.1008 0.1282 0.786 0.437
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.7827 on 35 degrees of freedom
Multiple R-squared: 0.01736, Adjusted R-squared: -0.01072
F-statistic: 0.6182 on 1 and 35 DF, p-value: 0.437
But as you can see, although I do not use Y as composed by 4 variables and X by 6, and I do not consider the negative values too, I get a very low score as my R^2.
If I act with anova instead I have this problem:
> Ymatrix<- as.matrix(Y)
> Xmatrix<- as.matrix(X) #where both this Y and X are in their first form, thus composed by more variables (4 and 6) and with negative values as well.
> Errore in UseMethod("anova") :
no applicable method for 'anova' applied to an object of class "c('matrix', 'integer', 'numeric')"
To be honest, a few days ago I succeeded in using anova, but unfortunately I do not remember how and I did not save the commands anywhere.
What I would like to know is:
First of all, am I wrong in how I approach to my problem?
What do you think about the regression output?
Finally, how can I do to make the anova? If I have to do it.

If your response (Y) and predictor (x) are numeric scale, you can use regression.
If your response (Y) is numeric scale with predictor (x) is categoric scale, you can use ANOVA.
Suggested:
you have to use validity and reliability test to know if the answers (indicators) are valid and reliable for response and predictor, before you use regression method.

I disagree with Denny's answer. You can use either approach regardless of the type of data that you have. If you have categorical data you can express it as numeric using dummy encoding. For example given a feature x with 3 options, say 1, 2, and 3, you can encode this as numeric by creating 3 new additional variables x1, x2, and x3. If x is 1 x1 will be 1, x2 will be 0, and x3 will be zero. If x is missing, the three new x values will all be zero.
In your case I would recommend that you try regression first because of the amount of features that you have and because it tends to be straight forward. ANOVA can become complicated as the number of the features increases. Both should work, assuming your data meets the assumptions required by both techniques.

Related

Precision in summary output of lm R

I am doing some exercises using package r-exams, in which I print a summary from an lm object and ask students things like, “which is the estimated value of the intercept”. The idea is that the student copies the values of the summary output and use that value as the correct answer. The issue here is that I use the values from coef() function as the correct answers, but this is not a good idea since the precision of these values are quite different from the precision of the values shown in the summary output. Here is an example:
set.seed(123)
library(tidyverse)
## DATA GENERATION
xbreaks<-c(runif(1,4,4.8),runif(1,6,6.9),runif(1,7.8,8.5),runif(1,9,10))
ybreaks<-c(runif(1,500,1000),runif(1,1800,4000),runif(1,200,800))
b11<-(ybreaks[2]-ybreaks[1])/(xbreaks[2]-xbreaks[1])
b10<-ybreaks[1]-b11*xbreaks[1]
b31<-(ybreaks[3]-ybreaks[2])/(xbreaks[4]-xbreaks[3])
b30<-ybreaks[2]-b31*xbreaks[3]
points_df<-data.frame(x=xbreaks,y=ybreaks[c(1,2,2,3)])
n<-rpois(3,120)
x1<-runif(n[1],xbreaks[1],xbreaks[2])
x2<-runif(n[2],xbreaks[2],xbreaks[3])
x3<-runif(n[3],xbreaks[3],xbreaks[4])
y<-c(b10+b11*x1+rnorm(n[1],0,200),
ybreaks[2]+rnorm(n[2],0,200),
b30+b31*x3+rnorm(n[3],0,200))
z0_aw<-data.frame(ph=c(x1,x2,x3),UFC=y,case=factor(c(rep(1,n[1]),rep(2,n[2]),rep(3,n[3]))))
mean_x<-z0_aw$ph%>% mean %>% round(2)
caserng<-sample(1:4,1)
modrng<-sample(1:2,1)
if(caserng!=4){
z0_aw<-z0_aw[z0_aw$case == caserng,]
}
if(modrng==1){
m0<-lm(UFC~ph,data=z0_aw)
}else{
cl <- call("lm", formula = UFC ~ I(ph - mean_x), data = as.name("z0_aw"))
cl$formula[[3]][[2]][[3]] <- mean_x
m0<-eval(cl)
}
summary(m0)
#>
#> Call:
#> lm(formula = UFC ~ I(ph - 7.2), data = z0_aw)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -555.53 -121.98 5.46 115.38 457.08
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2726.86 57.33 47.57 <2e-16 ***
#> I(ph - 7.2) -840.05 31.46 -26.70 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 182.7 on 116 degrees of freedom
#> Multiple R-squared: 0.8601, Adjusted R-squared: 0.8589
#> F-statistic: 713.1 on 1 and 116 DF, p-value: < 2.2e-16
coef(m0)
#> (Intercept) I(ph - 7.2)
#> 2726.8605 -840.0515
Created on 2021-05-14 by the reprex package (v2.0.0)
Suppose that extol: 0.0001 in r-exams is set, and the student is asked to give the estimated value of the intercept. The student will get a wrong answer since he will answer 2726.86 but the correct answer from coef is 2726.8605 .
As can be seen, output of summary uses 2 decimals, whereas coef() values has quite more precision. I want to know how many decimals is summary using in order to apply the same format to values produced by coef(). This will ensure that the answer provided by the student is the same as the summary output.
I just want to do this:
answers<-coef(m0) %>% format(digits=dsum) %>% as.numeric()
where dsum is the number of digits used also by the summary output.
Note: retain a precision of 4 decimals is needed since I also ask students about the R-squared value provided in the same summary output, so it is not a good idea to set extol: 0.01 for example. Also the problems are generated at random and the magnitude of the estimated coefficients changes, as I have noted that this is directly related to the precision used in summary output.
Some useful information for such questions in R/exams:
The extol can also be a vector so that you can set different tolerances for coefficients and R-squared etc.
When asking about the R-squared, though, I typically ask for it "in percent". Then the same tolerance may be suitable as for the coefficients.
I would recommend to control the size of the coefficients suitably so that digits and extol can be set accordingly.
Personally, I typically store the exsolution at a higher precision than I request from the students. For example, exsolution could be 12.345678 while I only set extol to 0.01. This makes sure that when the correct answer is rounded to two decimal places it is inside the correct interval determined by exsolution and extol.
Details on formatting of the coefficients in the summary:
It is not obvious where exactly the formatting happens: The summary() method for lm objects returns an object of class summary.lm which has its own print() method which in turn calls printCoefmat(). The latter is the function that does the actual formatting.
When setting the digits in these functions, this controls the number of significant digits and not the number of decimal places. This is particularly important when the coefficients become relatively large (say, in the thousands or more).
The coefficients are not formatted individually but jointly with the corresponding standard errors. The details depend on the digits, the size of both coefficients and standard errors, and whether any coefficients are aliased or exactly zero etc.
Without aliased/zero coefficients the formatting from summary(m0) can be replicated using format_coef(m0) as defined below. That's essentially the boiled-down code from printCoefmat().
format_coef <- function(object, digits = max(3L, getOption("digits") - 2L)) {
coef_se <- summary(object)$coefficients[, 1L:2L]
digmin <- 1L + floor(log10(range(abs(coef_se))))
format(round(coef_se, max(1L, digits - digmin)), digits = digits)[, 1L]
}

How to compute level values for dynlm output wth regressison using differenced series

I have used the dynlm function to regress a differenced series as the dependent variable and its 5 lags as the regressors. The summary output file is attached. Can someone help me compute the level fitted values from the summary output. I have also attached the data frame showing the values of residuals and the fitted values based on the above regression?
Summary Output
Time series regression with "ts" data:
Start = 6, End = 364
Call:
dynlm(formula = dusagets ~ (L(dusagets, 1:5)))
Residuals:
Min 1Q Median 3Q Max
-6915.9 -748.9 20.7 822.1 6099.6
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -41.24158 88.26414 -0.467 0.640608
L(dusagets, 1:5)1 -0.19753 0.05231 -3.776 0.000187 ***
L(dusagets, 1:5)2 -0.43436 0.05311 -8.179 5.22e-15 ***
L(dusagets, 1:5)3 -0.15207 0.05729 -2.654 0.008305 **
L(dusagets, 1:5)4 -0.14216 0.05292 -2.687 0.007561 **
L(dusagets, 1:5)5 -0.17909 0.05243 -3.415 0.000711 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1671 on 353 degrees of freedom
Multiple R-squared: 0.1858, Adjusted R-squared: 0.1743
F-statistic: 16.11 on 5 and 353 DF, p-value: 2.602e-14
P.S
How do I attach a file. Wanted to attach the file with the residuals and the fitted values from the regression but don't know how!
Best regards
Deepak
By working with differences all what is lost is the initial level of a series. Here's an example how to come back to levels where the situation is additionally complicated by introducing lagged terms.
y <- log10(UKDriverDeaths)
dy <- diff(y)
m <- dynlm(dy ~ L(dy, 1) + L(dy, 12))
Now fitted(m) has fitted values for differences and the only thing missing is to know where to start from. In particular, we have that
cumsum(fitted(m)) + y[1 + 12]
are values that can be compared to the initial series in levels,
tail(y, -(1 + 12))
where we lose 1 observation due to taking differences and another 12 correspond to the maximal lag.
Now why does cumsum(fitted(m)) + y[1 + 12] give the desired result? In general, let the observed series in levels be y1, y2, ... and the one in differences be Δy2, Δy3, ..., where notice that we don't have Δy1 due to absence of y0.
Now forgetting about lags and just thinking about the role of cumsum, notice that
yt = (yt-yt-1) + (yt-1-yt-2) + ... + (y2-y1) + y1 = Δyt + Δyt-1 + ... + Δy2 + y1.
That is, by summing all the changes from the beginning until the period t, we first get the aggregate change yt-y1, and then as to get yt we also add y1 - the starting level.
By using cumsum we accumulate those changes for each t in a vectorized fashion, and then we add y13 to the whole vector cumsum(fitted(m)) because they all have the same starting point of interest, y13.

Creating syntactically valid names from a factor in R while retaining levels

I am making a bioinformatics shiny app that reads user-supplied group names from an excel file. As these names can be non-sytactically valid names, I would like to represent them internally as valid names.
As an example, I can have this input:
(grps <- as.factor(c("T=0","T=0","T=4-","T=4+","T=4+")))
[1] T=0 T=0 T=4- T=4+ T=4+
Levels: T=0 T=4- T=4+
Ideally, I would like R to make valid names, but keep the groups/levels the same, for instance the following would be fine:
"T.0" "T.0" "T.4minus" "T.4plus" "T.4plus"
When using make.names() however, all non-valid characters are converted to the same charater:
(grps2 <- as.factor(make.names(grps)))
[1] T.0 T.0 T.4. T.4. T.4.
Levels: T.0 T.4.
So both T=4- and T=4+ are given the same name and a level is lost (which causes problems in subsequent analyses). Also, setting unique=TRUE does not solve the problem, because
(grps3 <- as.factor(make.names(grps,unique=TRUE)))
[1] T.0 T.0.1 T.4. T.4..1 T.4..2
Levels: T.0 T.0.1 T.4. T.4..1 T.4..2
and group T=4+ is split into 2 different groups and levels are gained.
Does anybody know how it is possible in general to make a factor into valid names, while keeping the same levels?
Please keep in mind that user input can widely vary, so manually replacing "-" with "minus" does not work here.
Thanks in advance for your help!
With the mapvalues function from plyr you can do:
require("plyr")
mapvalues(grps, levels(grps), make.names(levels(grps), unique=TRUE))
Since this works directly on the levels instead of the factor, the number of the values stays the same.
The labels associated with the levels of a factor are not required to fit the same expectations of object names. Consider the following example, where I rename the gear columns of the mtcars data set, make it a factor, and give it the same levels as you have given in your example.
library(magrittr)
library(dplyr)
library(broom)
D <- mtcars[c("mpg", "gear")] %>%
setNames(c("y", "grps")) %>%
mutate(grps = factor(grps, 3:5, c("T=0", "T=4-", "T=4+")))
Notice that I am able to fit a linear model, get a summary, force it to a data frame, all while the level names have the =, -, and + symbols in them.
fit <- lm(y ~ grps, data = D)
fit
Call:
lm(formula = y ~ grps, data = D)
Coefficients:
(Intercept) grpsT=4- grpsT=4+
16.107 8.427 5.273
summary(fit)
Call:
lm(formula = y ~ grps, data = D)
Residuals:
Min 1Q Median 3Q Max
-6.7333 -3.2333 -0.9067 2.8483 9.3667
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 16.107 1.216 13.250 7.87e-14 ***
grpsT=4- 8.427 1.823 4.621 7.26e-05 ***
grpsT=4+ 5.273 2.431 2.169 0.0384 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared: 0.4292, Adjusted R-squared: 0.3898
F-statistic: 10.9 on 2 and 29 DF, p-value: 0.0002948
tidy(fit)
term estimate std.error statistic p.value
1 (Intercept) 16.106667 1.215611 13.249852 7.867272e-14
2 grpsT=4- 8.426667 1.823417 4.621361 7.257382e-05
3 grpsT=4+ 5.273333 2.431222 2.169005 3.842222e-02
So I'm left thinking that either
You're making things harder on yourself than you need to, or
It isn't clear why you need to make the levels syntactically valid object names.

Multiple correlation coefficient in R

I am looking for a way to calculate the multiple correlation coefficient in R http://en.wikipedia.org/wiki/Multiple_correlation, is there a built-in function to calculate it ?
I have one dependent variable and three independent ones.
I am not able to find it online, any idea ?
The easiest way to calculate what you seem to be asking for when you refer to 'the multiple correlation coefficient' (i.e. the correlation between two or more independent variables on the one hand, and one dependent variable on the other) is to create a multiple linear regression (predicting the values of one variable treated as dependent from the values of two or more variables treated as independent) and then calculate the coefficient of correlation between the predicted and observed values of the dependent variable.
Here, for example, we create a linear model called mpg.model, with mpg as the dependent variable and wt and cyl as the independent variables, using the built-in mtcars dataset:
> mpg.model <- lm(mpg ~ wt + cyl, data = mtcars)
Having created the above model, we correlate the observed values of mpg (which are embedded in the object, within the model data frame) with the predicted values for the same variable (also embedded):
> cor(mpg.model$model$mpg, mpg.model$fitted.values)
[1] 0.9111681
R will in fact do this calculation for you, but without telling you so, when you ask it to create the summary of a model (as in Brian's answer): the summary of an lm object contains R-squared, which is the square of the coefficient of correlation between the predicted and observed values of the dependent variable. So an alternative way to get the same result is to extract R-squared from the summary.lm object and take the square root of it, thus:
> sqrt(summary(mpg.model)$r.squared)
[1] 0.9111681
I feel that I should point out, however, that the term 'multiple correlation coefficient' is ambiguous.
The built-in function lm gives at least one version, not sure if this is what you are looking for:
fit <- lm(yield ~ N + P + K, data = npk)
summary(fit)
Gives:
Call:
lm(formula = yield ~ N + P + K, data = npk)
Residuals:
Min 1Q Median 3Q Max
-9.2667 -3.6542 0.7083 3.4792 9.3333
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.650 2.205 24.784 <2e-16 ***
N1 5.617 2.205 2.547 0.0192 *
P1 -1.183 2.205 -0.537 0.5974
K1 -3.983 2.205 -1.806 0.0859 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.401 on 20 degrees of freedom
Multiple R-squared: 0.3342, Adjusted R-squared: 0.2343
F-statistic: 3.346 on 3 and 20 DF, p-value: 0.0397
More info on what's going on at ?summary.lm and ?lm.
Try this:
# load sample data
data(mtcars)
# calculate correlation coefficient between all variables in `mtcars` using
# the inbulit function
M <- cor(mtcars)
# M is a matrix of correlation coefficient which you can display just by
# running
print(M)
# If you want to plot the correlation coefficient
library(corrplot)
corrplot(M, method="number",type= "lower",insig = "blank", number.cex = 0.6)

Anova table comparing groups, in R, exported to latex?

I'm mostly work with observational data, but I read a lot of experimental hard-science papers that report results in the form of anova tables, with letters indicating the significance of the differences between the groups, and then p-values of the f-stat for the joint significance of what is essentially a factor variable regression. Here is an example that I've pulled off of google image search.
I think that this might be a useful way to present summary statistics about groupwise differences (or lack thereof) in an observational dataset, before I go ahead and try to control for them in various ways. I'm not sure exactly what test the letters are typically representing (Tukey something?), but pairwise t-tests would suit my purposes fine.
My main question: how can I get such an output from a factor variable regression in R, and how can I seamlessly export it into latex?
Here is some example data:
var = c(3500,600,12400,6400,1500,0,4400,400,900,2000,350,0,5800,0,12800,1200,350,800,2500,2000,0,3200,1100,0,0,0,0,0,1000,0,0,0,0,0,12400,6000,1700,3500,3000,1000,0,0,3500,5000,1000,3600,1600,3500,0,900,4200,0,0,0,0,1700,1250,500,950,500,600,1080,500,980,670,1200,600,550,4000,600,2800,650,0,3700,12500,0,0,0,1200,2700,0,NA,0,0,0,3700,2000,3500,0,0,0,3500,800,1400,0,500,7000,3500,0,0,0,0,2700,0,0,0,0,2000,5000,0,0,7000,0,4800,0,0,0,0,1800,0,2500,1600,4600,0,2000,5400,4500,3200,0,12200,0,3500,0,0,2800,3600,3000,0,3150,0,0,3750,2800,0,1000,1500,6000,3090,2800,600,0,0,1000,3800,3000,0,800,600,1200,0,240,1000,300,3600,0,1200,300,2700,NA,1300,1200,1400,4600,3200,750,300,750,1200,700,870,900,3200,1300,1500,1200,0,960,1800,8000,1200,NA,0,1080,1300,1080,900,700,5000,1500,3750,0,1400,900,1400,400,3900,0,1400,1600,960,1200,2600,420,3400,2500,500,4000,0,4250,570,600,4550,2000,0,0,4300,2000,0,0,0,0,NA,0,2060,2600,1600,1800,3000,900,0,0,3200,0,1500,3000,0,3700,6000,0,0,1250,1200,12800,0,1000,1100,0,950,2500,800,3000,3600,3600,1500,0,0,3600,800,0,1000,1600,1700,0,3500,3700,3000,350,700,3500,0,0,0,0,1500,0,400,0,0,0,0,0,0,0,500,0,0,0,0,5600,0,0,0)
factor = as.factor(c(5,2,5,5,5,3,4,5,5,5,3,1,1,1,5,3,6,6,6,5,5,5,3,5,3,3,3,3,4,3,3,3,4,3,5,5,3,5,3,3,3,3,5,3,3,3,3,3,5,5,5,5,5,3,3,5,3,5,5,3,5,5,4,3,5,5,5,5,5,5,4,5,3,5,4,4,3,4,3,5,3,3,5,5,5,3,5,5,4,3,3,5,5,4,3,3,5,3,3,4,3,3,3,3,5,5,3,5,5,3,3,5,4,3,3,3,4,4,5,3,1,5,5,1,5,5,5,3,3,4,5,5,5,3,3,4,5,4,5,3,5,5,5,3,3,3,3,3,3,3,3,3,3,3,4,3,3,3,3,3,3,3,4,5,4,6,4,3,5,5,3,5,3,3,4,3,5,5,5,3,5,3,3,5,5,5,3,4,3,3,3,5,3,5,3,5,5,3,5,3,5,5,5,5,5,3,5,3,5,3,4,5,5,5,6,5,5,5,5,4,5,3,5,3,3,5,4,3,5,3,4,5,3,5,3,5,3,1,5,1,5,3,5,5,5,3,6,3,5,3,5,2,5,5,5,1,5,5,6,5,4,5,4,3,3,3,5,3,3,3,3,5,3,3,3,3,3,3,5,5,5,4,4,4,5,5,3,5,4,5,5,4,3,3,3,4,3,5,5,4,3,3))
do a simple regression on them and you get the following
m = lm((var-mean(var,na.rm=TRUE))~factor-1)
summary(m)
Call:
lm(formula = (var - mean(var, na.rm = TRUE)) ~ factor - 1)
Residuals:
Min 1Q Median 3Q Max
-2040.5 -1240.2 -765.5 957.1 10932.8
Coefficients:
Estimate Std. Error t value Pr(>|t|)
factor1 -82.42 800.42 -0.103 0.9181
factor2 -732.42 1600.84 -0.458 0.6476
factor3 -392.17 204.97 -1.913 0.0567 .
factor4 -65.19 377.32 -0.173 0.8629
factor5 408.07 204.13 1.999 0.0465 *
factor6 303.30 855.68 0.354 0.7233
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2264 on 292 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.02677, Adjusted R-squared: 0.006774
F-statistic: 1.339 on 6 and 292 DF, p-value: 0.2397
It looks pretty clear that factors 3 and 5 are different from zero, different from each other, but that factor 3 is not different from 2 and factor 5 is not different from 6, respectively (at whatever p value).
How can I get this into anova table output like in the example above? And is there a clean way to get this into latex, ideally in a form that allows a lot of variables?
The following answers only the third question.
It looks like xtable does what you'd like to do - exporting R tables to $\LaTeX$ code.
There's a nice gallery as well.
I've found both in a wiki post on stackoverflow.

Resources