Specify the ID of high-leverage points

Specify the ID of high-leverage points - r

Take a glance at the model
> summary(left_int3)
Call:
lm(formula = Left ~ Cursor + PostCursor + CTLE + I(Cursor^2) +
I(PostCursor^2), data = QPI)
Residuals:
Min 1Q Median 3Q Max
-3585033 -845980 -58093 718849 4289610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.455e+07 6.651e+05 21.880 < 2e-16 ***
Cursor 2.299e-06 1.302e-06 1.766 0.07754 .
PostCursor 1.150e-06 2.147e-06 0.536 0.59231
CTLE -4.772e+00 4.548e-01 -10.493 < 2e-16 ***
I(Cursor^2) -2.162e-18 6.854e-19 -3.154 0.00163 **
I(PostCursor^2) -2.775e-19 5.977e-19 -0.464 0.64253
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1175000 on 2794 degrees of freedom
Multiple R-squared: 0.3942, Adjusted R-squared: 0.3932
F-statistic: 363.7 on 5 and 2794 DF, p-value: < 2.2e-16
And I'd like to know it there are any observations are with high hatvalues, I used the command below, and I used 0.04 as the criteria
>hatvalues(left_int3)>0.04
1 2 3 4 5 6 7 8
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And I used the command to obtain the number of True and False:
> hat<-hatvalues(left_int3)>0.04
> summary(hat)
Mode FALSE TRUE NA's
logical 2780 20 0
So I'd like to know if there is any command to know what's the ID number of these observations are "TRUE".

Related

Different output based on console window width

I came across a strange behavior of my R console, although it is probably me who made it do something strange rather than a bug.
When capturing the output of a model (let's say a regression model as in the example below) and saving it to a data frame to make it amendable, and when subsetting a certain amount of rows I am interested in, I will get different results depending on the window width of my console in RStudio.
This would be a minimal working example:
> x <- c(1, 1, 0.5, 0.5, 0, 0)
> y <- c(0, 0, 0.5, 0.5, 1, 1)
>
> model <- lm(y ~ x) # Basic regression model
> output <- summary(model) # Saving the summary
> output
Call:
lm(formula = y ~ x)
Residuals:
1 2 3 4 5 6
2.725e-16 -2.477e-16 -2.484e-17 -2.484e-17 1.242e-17 1.242e-17
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***
x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.852e-16 on 4 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
> op2 <- capture.output(output) # Capturing the the summary to make it amendable
> op2 <- data.frame(op2) # Saving the ammendable summary as data frame
> op2
op2
1
2 Call:
3 lm(formula = y ~ x)
4
5 Residuals:
6 1 2 3 4 5 6
7 2.725e-16 -2.477e-16 -2.484e-17 -2.484e-17 1.242e-17 1.242e-17
8
9 Coefficients:
10 Estimate Std. Error t value Pr(>|t|)
11 (Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***
12 x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***
13 ---
14 Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
15
16 Residual standard error: 1.852e-16 on 4 degrees of freedom
17 Multiple R-squared: 1,\tAdjusted R-squared: 1
18 F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
19
> op3 <- op2[9:14,] # I'm only interested in rows 9 to 14 of the summary, so I subset them
> op3 # And printing them works just fine
[1] "Coefficients:"
[2] " Estimate Std. Error t value Pr(>|t|) "
[3] "(Intercept) 1.000e+00 1.195e-16 8.367e+15 <2e-16 ***"
[4] "x -1.000e+00 1.852e-16 -5.401e+15 <2e-16 ***"
[5] "---"
[6] "Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1"
Now, if I reduce the width of my console (which I have set on the top right of my RStudio) and run the code again from op2 onwards, I will get a different result because the row numbers of op2 have changed since they seem dependent on the width of the console.
Like this:
> op2 <- capture.output(output)
> op2 <- data.frame(op2)
> op2
op2
1
2 Call:
3 lm(formula = y ~ x)
4
5 Residuals:
6 1 2 3
7 2.725e-16 -2.477e-16 -2.484e-17
8 4 5 6
9 -2.484e-17 1.242e-17 1.242e-17
10
11 Coefficients:
12 Estimate
13 (Intercept) 1.000e+00
14 x -1.000e+00
15 Std. Error
16 (Intercept) 1.195e-16
17 x 1.852e-16
18 t value Pr(>|t|)
19 (Intercept) 8.367e+15 <2e-16
20 x -5.401e+15 <2e-16
21
22 (Intercept) ***
23 x ***
24 ---
25 Signif. codes:
26 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’
27 0.05 ‘.’ 0.1 ‘ ’ 1
28
29 Residual standard error: 1.852e-16 on 4 degrees of freedom
30 Multiple R-squared: 1,\tAdjusted R-squared: 1
31 F-statistic: 2.917e+31 on 1 and 4 DF, p-value: < 2.2e-16
32
> op3 <- op2[9:14,]
> op3
[1] "-2.484e-17 1.242e-17 1.242e-17 "
[2] ""
[3] "Coefficients:"
[4] " Estimate"
[5] "(Intercept) 1.000e+00"
[6] "x -1.000e+00"
Any idea on (i) why the row numbers of op2 are dependent on the width of my console and (ii) how to avoid this?
Many thanks in advance.

Why is my summary in R only including some of my variables?

I am trying to see if there is a relationship between number of bat calls and the time of pup rearing season. The pup variable has three categories: "Pre", "Middle", and "Post". When I ask for the summary, it only included the p-values for Pre and Post pup production. I created a sample data set below. With the sample data set, I just get an error.... with my actual data set I get the output I described above.
SAMPLE DATA SET:
Calls<- c("55","60","180","160","110","50")
Pup<-c("Pre","Middle","Post","Post","Middle","Pre")
q<-data.frame(Calls, Pup)
q
q1<-lm(Calls~Pup, data=q)
summary(q1)
OUTPUT AND ERROR MESSAGE FROM SAMPLE:
> Calls Pup
1 55 Pre
2 60 Middle
3 180 Post
4 160 Post
5 110 Middle
6 50 Pre
Error in as.character.factor(x) : malformed factor
In addition: Warning message:
In Ops.factor(r, 2) : ‘^’ not meaningful for factors
ACTUAL INPUT FOR MY ANALYSIS:
> pupint <- lm(Calls ~ Pup, data = park2)
summary(pupint)
THIS IS THE OUTPUT I GET FROM MY ACTUAL DATA SET:
Residuals:
Min 1Q Median 3Q Max
-66.40 -37.63 -26.02 -5.39 299.93
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.54 35.82 1.858 0.0734 .
PupPost -51.98 48.50 -1.072 0.2927
PupPre -26.47 39.86 -0.664 0.5118
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 80.1 on 29 degrees of freedom
Multiple R-squared: 0.03822, Adjusted R-squared: -0.02811
F-statistic: 0.5762 on 2 and 29 DF, p-value: 0.5683
Overall, just wondering why the above output isn't showing "Middle". Sorry my sample data set didn't work out the same but maybe that error message will help better understand the problem.

For R to correctly understand a dummy variable, you have to indicate Pup is a cualitative (dummy) variable by using factor
> Pup <- factor(Pup)
> q<-data.frame(Calls, Pup)
> q1<-lm(Calls~Pup, data=q)
> summary(q1)
Call:
lm(formula = Calls ~ Pup, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 85.00 15.61 5.444 0.0122 *
PupPost 85.00 22.08 3.850 0.0309 *
PupPre -32.50 22.08 -1.472 0.2374
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9097, Adjusted R-squared: 0.8494
F-statistic: 15.1 on 2 and 3 DF, p-value: 0.02716
If you want R to show all categories inside the dummy variable, then you must remove the intercept from the regression, otherwise, you will be in variable dummy trap.
summary(lm(Calls~Pup-1, data=q))
Call:
lm(formula = Calls ~ Pup - 1, data = q)
Residuals:
1 2 3 4 5 6
2.5 -25.0 10.0 -10.0 25.0 -2.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
PupMiddle 85.00 15.61 5.444 0.01217 *
PupPost 170.00 15.61 10.889 0.00166 **
PupPre 52.50 15.61 3.363 0.04365 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 22.08 on 3 degrees of freedom
Multiple R-squared: 0.9815, Adjusted R-squared: 0.9631
F-statistic: 53.17 on 3 and 3 DF, p-value: 0.004234

If you include a categorical variable like pup in a regression, then it is including a dummy variable for each value within that variable except for one by default. You could show a coefficient for pupmiddle if you omit instead the intercept coefficient like this:
q1<-lm(Calls~Pup - 1, data=q)

How to generate data for significant testing?

I want to generate some data to do linear regression and model selection. Here is simple sample I used, but how can I generate some independent variables to satisfy the P-value of them close to 0.05?
Actually I'm not sure whether this question is right or not. Thanks for any recommendations!
a=rnorm(100,mean=5,sd=2)
b=rnorm(100)
c=rnorm(100,mean=3,sd=1)
d=rnorm(100,mean=40,sd=5)
e=rnorm(100,mean=80,sd=7)
g=rnorm(100,mean=7.9,sd=0.5)
f=sample(c(0,1),100,prob=c(0.6,0.4),replace=T)
yy=2*a+0.1*b+3*c-0.6*d+0.2*e+0.9*f-2*g+rnorm(100,mean=0,sd=1)
ll=lm(yy~a+b+c+d+e+factor(f)+g)
summary(ll)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.05618 2.67722 -1.142 0.25660
# a 1.98623 0.05521 35.974 < 2e-16 ***
# b 0.05994 0.10657 0.562 0.57520
# c 2.98780 0.10386 28.767 < 2e-16 ***
# d -0.59633 0.01915 -31.134 < 2e-16 ***
# e 0.20678 0.01644 12.577 < 2e-16 ***
# factor(f)1 0.72422 0.24321 2.978 0.00371 **
# g -1.67970 0.25617 -6.557 3.15e-09 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.122 on 92 degrees of freedom
# Multiple R-squared: 0.972, Adjusted R-squared: 0.9699
# F-statistic: 456 on 7 and 92 DF, p-value: < 2.2e-16

BigO of Algorithm Using Multiple Variable Regression

For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000

contrasts in anova

I understand the contrasts from previous posts and I think I am doing the right thing but it is not giving me what I would expect.
x <- c(11.80856, 11.89269, 11.42944, 12.03155, 10.40744,
12.48229, 12.1188, 11.76914, 0, 0,
13.65773, 13.83269, 13.2401, 14.54421, 13.40312)
type <- factor(c(rep("g",5),rep("i",5),rep("t",5)))
type
[1] g g g g g i i i i i t t t t t
Levels: g i t
When I run this:
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.514 1.729 6.659 2.33e-05 ***
typei -4.240 2.445 -1.734 0.109
typet 2.222 2.445 0.909 0.381
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
Here my reference is my type "g", so my typei is the difference between type "g" and type "i", and my typet is the difference between type "g" and type "t".
I wanted to see two more contrasts here, the difference between typei+typeg and type "t" and difference between type "i" and type "t"
so the contrasts
> contrasts(type) <- cbind( c(-1,-1,2),c(0,-1,1))
> summary.lm(aov(x~type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.2740 -0.4140 0.0971 0.6631 5.2082
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.8412 0.9983 10.860 1.46e-07 ***
type1 -0.6728 1.4118 -0.477 0.642
type2 4.2399 2.4453 1.734 0.109
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.866 on 12 degrees of freedom
Multiple R-squared: 0.3753, Adjusted R-squared: 0.2712
F-statistic: 3.605 on 2 and 12 DF, p-value: 0.05943
When I try to do the second contrast by changing my reference I get different results. I am not understanding what is wrong with my contrast.

Refence: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm
mat <- cbind(rep(1/3, 3), "g+i vs t"=c(-1/2, -1/2, 1),"i vs t"=c(0, -1, 1))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(type) <- my.contrast
summary.lm(aov(x ~ type))
my.contrast
> g+i vs t i vs t
[1,] -1.3333 1
[2,] 0.6667 -1
[3,] 0.6667 0
> contrasts(type) <- my.contrast
> summary.lm(aov(x ~ type))
Call:
aov(formula = x ~ type)
Residuals:
Min 1Q Median 3Q Max
-7.274 -0.414 0.097 0.663 5.208
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.841 0.998 10.86 1.5e-07 ***
typeg+i vs t 4.342 2.118 2.05 0.063 .
typei vs t 6.462 2.445 2.64 0.021 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.87 on 12 degrees of freedom
Multiple R-squared: 0.375, Adjusted R-squared: 0.271
F-statistic: 3.6 on 2 and 12 DF, p-value: 0.0594

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Specify the ID of high-leverage points - r

Related

Different output based on console window width

Why is my summary in R only including some of my variables?

How to generate data for significant testing?

BigO of Algorithm Using Multiple Variable Regression

contrasts in anova

Categories

Resources