I came across a strange behavior of my R console, although it is probably me who made it do something strange rather than a bug.
When capturing the output of a model (let's say a regression model as in the example below) and saving it to a data frame to make it amendable, and when subsetting a certain amount of rows I am interested in, I will get different results depending on the window width of my console in RStudio.
This would be a minimal working example:
> x <- c(1, 1, 0.5, 0.5, 0, 0)
> y <- c(0, 0, 0.5, 0.5, 1, 1)
>
> model <- lm(y ~ x) # Basic regression model
> output <- summary(model) # Saving the summary
> output
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
2.725e-16 -2.477e-16 -2.484e-17 -2.484e-17 1.242e-17 1.242e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***
x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.852e-16 on 4 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
> op2 <- capture.output(output) # Capturing the the summary to make it amendable
> op2 <- data.frame(op2) # Saving the ammendable summary as data frame
> op2
op2
1
2 Call:
3 lm(formula = y ~ x)
4
5 Residuals:
6 1 2 3 4 5 6
7 2.725e-16 -2.477e-16 -2.484e-17 -2.484e-17 1.242e-17 1.242e-17
8
9 Coefficients:
10 Estimate Std. Error t value Pr(>|t|)
11 (Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***
12 x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***
13 ---
14 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15
16 Residual standard error: 1.852e-16 on 4 degrees of freedom
17 Multiple R-squared: 1,\tAdjusted R-squared: 1
18 F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
19
> op3 <- op2[9:14,] # I'm only interested in rows 9 to 14 of the summary, so I subset them
> op3 # And printing them works just fine
[1] "Coefficients:"
[2] " Estimate Std. Error t value Pr(>|t|) "
[3] "(Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***"
[4] "x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***"
[5] "---"
[6] "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1"
Now, if I reduce the width of my console (which I have set on the top right of my RStudio) and run the code again from op2 onwards, I will get a different result because the row numbers of op2 have changed since they seem dependent on the width of the console.
Like this:
> op2 <- capture.output(output)
> op2 <- data.frame(op2)
> op2
op2
1
2 Call:
3 lm(formula = y ~ x)
4
5 Residuals:
6 1 2 3
7 2.725e-16 -2.477e-16 -2.484e-17
8 4 5 6
9 -2.484e-17 1.242e-17 1.242e-17
10
11 Coefficients:
12 Estimate
13 (Intercept) 1.000e+00
14 x -1.000e+00
15 Std. Error
16 (Intercept) 1.195e-16
17 x 1.852e-16
18 t value Pr(>|t|)
19 (Intercept) 8.367e+15 <2e-16
20 x -5.401e+15 <2e-16
21
22 (Intercept) ***
23 x ***
24 ---
25 Signif. codes:
26 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
27 0.05 ‘.’ 0.1 ‘ ’ 1
28
29 Residual standard error: 1.852e-16 on 4 degrees of freedom
30 Multiple R-squared: 1,\tAdjusted R-squared: 1
31 F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
32
> op3 <- op2[9:14,]
> op3
[1] "-2.484e-17 1.242e-17 1.242e-17 "
[2] ""
[3] "Coefficients:"
[4] " Estimate"
[5] "(Intercept) 1.000e+00"
[6] "x -1.000e+00"
Any idea on (i) why the row numbers of op2 are dependent on the width of my console and (ii) how to avoid this?
Many thanks in advance.
Related
I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.
For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234
If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)
I want to generate some data to do linear regression and model selection. Here is simple sample I used, but how can I generate some independent variables to satisfy the P-value of them close to 0.05?
Actually I'm not sure whether this question is right or not. Thanks for any recommendations!
a=rnorm(100,mean=5,sd=2)
b=rnorm(100)
c=rnorm(100,mean=3,sd=1)
d=rnorm(100,mean=40,sd=5)
e=rnorm(100,mean=80,sd=7)
g=rnorm(100,mean=7.9,sd=0.5)
f=sample(c(0,1),100,prob=c(0.6,0.4),replace=T)
yy=2*a+0.1*b+3*c-0.6*d+0.2*e+0.9*f-2*g+rnorm(100,mean=0,sd=1)
ll=lm(yy~a+b+c+d+e+factor(f)+g)
summary(ll)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.05618 2.67722 -1.142 0.25660
# a 1.98623 0.05521 35.974 < 2e-16 ***
# b 0.05994 0.10657 0.562 0.57520
# c 2.98780 0.10386 28.767 < 2e-16 ***
# d -0.59633 0.01915 -31.134 < 2e-16 ***
# e 0.20678 0.01644 12.577 < 2e-16 ***
# factor(f)1 0.72422 0.24321 2.978 0.00371 **
# g -1.67970 0.25617 -6.557 3.15e-09 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.122 on 92 degrees of freedom
# Multiple R-squared: 0.972, Adjusted R-squared: 0.9699
# F-statistic: 456 on 7 and 92 DF, p-value: < 2.2e-16
Using the this R linear modelling tutorial I'm finding the format of the model output is annoyingly different to that provided in the text and I can't for the life of me work out why. For example, here is the code:
pitch = c(233,204,242,130,112,142)
sex = c(rep("female",3),rep("male",3))
my.df = data.frame(sex,pitch)
xmdl = lm(pitch ~ sex, my.df)
summary(xmdl)
Here is the output I get:
Call:
lm(formula = pitch ~ sex, data = my.df)
Residuals:
1 2 3 4 5 6
6.667 -22.333 15.667 2.000 -16.000 14.000
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 177.167 7.201 24.601 1.62e-05 ***
sex1 49.167 7.201 6.827 0.00241 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 17.64 on 4 degrees of freedom
Multiple R-squared: 0.921, Adjusted R-squared: 0.9012
F-statistic: 46.61 on 1 and 4 DF, p-value: 0.002407
In the tutorial the line for Coefficients has "sexmale" instead of "sex1". What setting do I need to activate to achieve this?
Take a glance at the model
> summary(left_int3)
Call:
lm(formula = Left ~ Cursor + PostCursor + CTLE + I(Cursor^2) +
I(PostCursor^2), data = QPI)
Residuals:
Min 1Q Median 3Q Max
-3585033 -845980 -58093 718849 4289610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.455e+07 6.651e+05 21.880 < 2e-16 ***
Cursor 2.299e-06 1.302e-06 1.766 0.07754 .
PostCursor 1.150e-06 2.147e-06 0.536 0.59231
CTLE -4.772e+00 4.548e-01 -10.493 < 2e-16 ***
I(Cursor^2) -2.162e-18 6.854e-19 -3.154 0.00163 **
I(PostCursor^2) -2.775e-19 5.977e-19 -0.464 0.64253
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1175000 on 2794 degrees of freedom
Multiple R-squared: 0.3942, Adjusted R-squared: 0.3932
F-statistic: 363.7 on 5 and 2794 DF, p-value: < 2.2e-16
And I'd like to know it there are any observations are with high hatvalues, I used the command below, and I used 0.04 as the criteria
>hatvalues(left_int3)>0.04
1 2 3 4 5 6 7 8
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And I used the command to obtain the number of True and False:
> hat<-hatvalues(left_int3)>0.04
> summary(hat)
Mode FALSE TRUE NA's
logical 2780 20 0
So I'd like to know if there is any command to know what's the ID number of these observations are "TRUE".
I understand the contrasts from previous posts and I think I am doing the right thing but it is not giving me what I would expect.
x <- c(11.80856, 11.89269, 11.42944, 12.03155, 10.40744,
12.48229, 12.1188, 11.76914, 0, 0,
13.65773, 13.83269, 13.2401, 14.54421, 13.40312)
type <- factor(c(rep("g",5),rep("i",5),rep("t",5)))
type
[1] g g g g g i i i i i t t t t t
Levels: g i t
When I run this:
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.514 1.729 6.659 2.33e-05 ***
typei -4.240 2.445 -1.734 0.109
typet 2.222 2.445 0.909 0.381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
Here my reference is my type "g", so my typei is the difference between type "g" and type "i", and my typet is the difference between type "g" and type "t".
I wanted to see two more contrasts here, the difference between typei+typeg and type "t" and difference between type "i" and type "t"
so the contrasts
> contrasts(type) <- cbind( c(-1,-1,2),c(0,-1,1))
> summary.lm(aov(x~type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.8412 0.9983 10.860 1.46e-07 ***
type1 -0.6728 1.4118 -0.477 0.642
type2 4.2399 2.4453 1.734 0.109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
When I try to do the second contrast by changing my reference I get different results. I am not understanding what is wrong with my contrast.
Refence: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
mat <- cbind(rep(1/3, 3), "g+i vs t"=c(-1/2, -1/2, 1),"i vs t"=c(0, -1, 1))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(type) <- my.contrast
summary.lm(aov(x ~ type))
my.contrast
> g+i vs t i vs t
[1,] -1.3333 1
[2,] 0.6667 -1
[3,] 0.6667 0
> contrasts(type) <- my.contrast
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.274 -0.414 0.097 0.663 5.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.841 0.998 10.86 1.5e-07 ***
typeg+i vs t 4.342 2.118 2.05 0.063 .
typei vs t 6.462 2.445 2.64 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.87 on 12 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.271
F-statistic: 3.6 on 2 and 12 DF, p-value: 0.0594