How to generate data for significant testing? - r

I want to generate some data to do linear regression and model selection. Here is simple sample I used, but how can I generate some independent variables to satisfy the P-value of them close to 0.05?
Actually I'm not sure whether this question is right or not. Thanks for any recommendations!
a=rnorm(100,mean=5,sd=2)
b=rnorm(100)
c=rnorm(100,mean=3,sd=1)
d=rnorm(100,mean=40,sd=5)
e=rnorm(100,mean=80,sd=7)
g=rnorm(100,mean=7.9,sd=0.5)
f=sample(c(0,1),100,prob=c(0.6,0.4),replace=T)
yy=2*a+0.1*b+3*c-0.6*d+0.2*e+0.9*f-2*g+rnorm(100,mean=0,sd=1)
ll=lm(yy~a+b+c+d+e+factor(f)+g)
summary(ll)
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.05618 2.67722 -1.142 0.25660
# a 1.98623 0.05521 35.974 < 2e-16 ***
# b 0.05994 0.10657 0.562 0.57520
# c 2.98780 0.10386 28.767 < 2e-16 ***
# d -0.59633 0.01915 -31.134 < 2e-16 ***
# e 0.20678 0.01644 12.577 < 2e-16 ***
# factor(f)1 0.72422 0.24321 2.978 0.00371 **
# g -1.67970 0.25617 -6.557 3.15e-09 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 1.122 on 92 degrees of freedom
# Multiple R-squared: 0.972, Adjusted R-squared: 0.9699
# F-statistic: 456 on 7 and 92 DF, p-value: < 2.2e-16

Related

Why does the use of log generate missing values (NA in regression coefficient)?

I have 4 data sets which is CO2 emission , GDP per capita, GDP per capita square and Number of tourist arrival. I am trying to run a model to observe the number of tourist arrival impact on Co2 emission in order to derive the Tourism induced Environmental Kuznets Curve. Below is the code and summary results .
Without log
Yt<-Data$`CO2 emissions`
X1t<-Data$`GDP per capita`
X2t<-Data$`GDP per caita square`
X3t<-Data$`Number of Tourist arrival`
model<-lm(Yt~X1t+X2t+X3t)
summary(model)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.238e-02 7.395e-03 1.674 0.100187
X1t -2.581e-05 6.710e-05 -0.385 0.702139
X2t 1.728e-07 4.572e-08 3.780 0.000413 ***
X3t 1.928e-07 3.501e-08 5.507 1.2e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.02252 on 51 degrees of freedom
Multiple R-squared: 0.9475, Adjusted R-squared: 0.9444
F-statistic: 306.5 on 3 and 51 DF, p-value: < 2.2e-16
With log
LYt<-(log(Yt))
LX1t<-(log(X1t))
LX2t<-(log(X2t))
LX3t<-(log(X3t))
model1<-lm(LYt~LX1t+LX2t+LX3t)
summary(model1)
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.38623 0.46040 -20.387 < 2e-16 ***
LX1t 0.83679 0.09834 8.509 2.01e-11 ***
LX2t NA NA NA NA
LX3t 0.17802 0.06888 2.585 0.0126 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2863 on 52 degrees of freedom
Multiple R-squared: 0.9074, Adjusted R-squared: 0.9038
F-statistic: 254.7 on 2 and 52 DF, p-value: < 2.2e-16
It is pretty evident that GDP per capita and GDP per capita square are perfectly collinear. However, why does the regression coefficients show missing values (NA) only in the case of log transformed model?

lm(summary) : essentially perfect fit: summary may be unreliable error for standardized datas

I am a beginner in R and statistics in general. I am trying to build a regression model with 4 variables (one of them is nominal data, with 3 alt categories). I think i managed to build a model with my raw data. But I wanted to standardize my data-set and the lm essentially perfect fit: summary may be unreliable message.
This is the summary of the model with raw data;
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.0713 5.9131 -0.350 0.727
Vdownload 8.6046 0.5286 16.279 < 2e-16 ***
DownloadDegisim 2.8854 0.6822 4.229 4.25e-05 ***
Vupload -4.2877 0.5418 -7.914 7.32e-13 ***
Saglayici2 -8.2084 0.6043 -13.583 < 2e-16 ***
Saglayici3 -9.8869 0.5944 -16.634 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.885 on 138 degrees of freedom
Multiple R-squared: 0.8993, Adjusted R-squared: 0.8956
F-statistic: 246.5 on 5 and 138 DF, p-value: < 2.2e-16
I wrote these codes to standardize my data
memnuniyet_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vdownload_scaled <-scale(Vdownload, center = TRUE, scale = TRUE)
Vupload_scaled <-scale(Vupload, center = TRUE, scale = TRUE)
DownloadD_scaled <- scale(DownloadDegisim, center = TRUE, scale = TRUE)
result<-lm(memnuniyet_scaled~Vdownload_scaled+DownloadD_scaled+Vupload_scaled+Saglayıcı)
summary(result)
And this is the summary of my standardized data
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.079e-17 5.493e-17 9.250e-01 0.357
Vdownload_scaled 1.000e+00 6.667e-17 1.500e+16 <2e-16 ***
DownloadD_scaled -4.591e-17 8.189e-17 -5.610e-01 0.576
Vupload_scaled 9.476e-18 6.337e-17 1.500e-01 0.881
Saglayici2 -6.523e-17 7.854e-17 -8.300e-01 0.408
Saglayici3 -8.669e-17 7.725e-17 -1.122e+00 0.264
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.75e-16 on 138 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 2.034e+32 on 5 and 138 DF, p-value: < 2.2e-16
I do know that R value should not have changed with standardization and have no idea what I did wrong.

Separate output when results='hold' in knitr

Is there a way to separate or section the output in a {r results='hold'} block in knitr without needing to manually insert a print('--------') command in between or the like? I want to avoid such extra print lines is that it makes the code much harder to read.
For example, you need to be quite trained to see where the output from one line stops and the next begins in the output from this:
```{r, results='hold'}
summary(lm(Sepal.Length ~ Species, iris))
#print('-------------') # Not a solution
summary(aov(Sepal.Length ~ Species, iris))
```
Output:
Call:
lm(formula = Sepal.Length ~ Species, data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6880 -0.3285 -0.0060 0.3120 1.3120
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.0060 0.0728 68.762 < 2e-16 ***
Speciesversicolor 0.9300 0.1030 9.033 8.77e-16 ***
Speciesvirginica 1.5820 0.1030 15.366 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5148 on 147 degrees of freedom
Multiple R-squared: 0.6187, Adjusted R-squared: 0.6135
F-statistic: 119.3 on 2 and 147 DF, p-value: < 2.2e-16
Df Sum Sq Mean Sq F value Pr(>F)
Species 2 63.21 31.606 119.3 <2e-16 ***
Residuals 147 38.96 0.265
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Maybe some knitr option or a more general R printing option?

BigO of Algorithm Using Multiple Variable Regression

For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000

Specify the ID of high-leverage points

Take a glance at the model
> summary(left_int3)
Call:
lm(formula = Left ~ Cursor + PostCursor + CTLE + I(Cursor^2) +
I(PostCursor^2), data = QPI)
Residuals:
Min 1Q Median 3Q Max
-3585033 -845980 -58093 718849 4289610
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.455e+07 6.651e+05 21.880 < 2e-16 ***
Cursor 2.299e-06 1.302e-06 1.766 0.07754 .
PostCursor 1.150e-06 2.147e-06 0.536 0.59231
CTLE -4.772e+00 4.548e-01 -10.493 < 2e-16 ***
I(Cursor^2) -2.162e-18 6.854e-19 -3.154 0.00163 **
I(PostCursor^2) -2.775e-19 5.977e-19 -0.464 0.64253
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1175000 on 2794 degrees of freedom
Multiple R-squared: 0.3942, Adjusted R-squared: 0.3932
F-statistic: 363.7 on 5 and 2794 DF, p-value: < 2.2e-16
And I'd like to know it there are any observations are with high hatvalues, I used the command below, and I used 0.04 as the criteria
>hatvalues(left_int3)>0.04
1 2 3 4 5 6 7 8
FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
And I used the command to obtain the number of True and False:
> hat<-hatvalues(left_int3)>0.04
> summary(hat)
Mode FALSE TRUE NA's
logical 2780 20 0
So I'd like to know if there is any command to know what's the ID number of these observations are "TRUE".

Resources