I have to run a simulation with R.
Basically I have to create a set of variables (X) in a matrix nXp where the first variable has value 1, and the other 23 variables has random values randomly extracted from a N(0,1).
A vector beta of length 24 which has as first 2 values 1 and the rest is 0.
A vector epsilon of length 24 which is extracted from a N(0,1).
After that I create a variable y which is: y=X %*% beta + epsilon.
Then I select the variable of X which has the max(cor(abs(Xj,y))) where j goes from (3,24) and I have to run a model y ~ X1 + X2 + Xj and see the results.
> set.seed(123)
>
> n=25 p=24 b=seq(1,1000)
>
> X <- cbind(matrix(1,nrow=25,ncol=1),matrix(rnorm(25*23),nrow=25,
> ncol=23)) beta <- t(t(c(1,1,rep(0,22)))) eps <- t(t(rnorm(25)))
>
> y <- X %*% beta + eps
>
> j<-seq(3,24) m <- which.max(abs(cor(X[,j],y)))
>
> newX <- as.data.frame(cbind(y,X[,1], X[,2], X[,m+2])) anyNA(newX[,2])
> mod <- lm(V1 ~ . , data=newX)
> summary(mod)
Call: lm(formula = V1 ~ ., data = newX)
Residuals:
Min 1Q Median 3Q Max
-1.42575 -0.90957 0.06547 0.38879 2.39707
Coefficients: (1 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1235 0.2421 4.641 0.000126 ***
V2 NA NA NA NA
V3 0.6803 0.2775 2.452 0.022612 *
V4 -0.5943 0.3036 -1.957 0.063101 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.085 on 22 degrees of freedom Multiple R-squared: 0.3958,
Adjusted R-squared: 0.3408 F-statistic: 7.205 on 2 and 22 DF, p-value: 0.003919
Everything works fine, but as you can see the estimated coefficient for V2, that would be the coefficient for the variable X1, which is composed just by 1, is NA.
I don't understand why I have NA as a result, the values in the variable are listed as numeric and there's no missing value.
If anyone can help me to understand, thanks!
Related
For Y = % of population with income below poverty level and X = per capita income of population, I have constructed a box-cox plot and found that the lambda = 0.02020:
bc <- boxcox(lm(Percent_below_poverty_level ~ Per_capita_income, data=tidy.CDI), plotit=T)
bc$x[which.max(bc$y)] # gives lambda
Now I want to fit a simple linear regression using the transformed data, so I've entered this code
transform <- lm((Percent_below_poverty_level**0.02020) ~ (Per_capita_income**0.02020))
transform
But all I get is the error message
'Error in terms.formula(formula, data = data) : invalid power in formula'. What is my mistake?
You could use bcPower() from the car package.
## make sure you do install.packages("car") if you haven't already
library(car)
data(Prestige)
p <- powerTransform(prestige ~ income + education + type ,
data=Prestige,
family="bcPower")
summary(p)
# bcPower Transformation to Normality
# Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
# Y1 1.3052 1 0.9408 1.6696
#
# Likelihood ratio test that transformation parameter is equal to 0
# (log transformation)
# LRT df pval
# LR test, lambda = (0) 41.67724 1 1.0765e-10
#
# Likelihood ratio test that no transformation is needed
# LRT df pval
# LR test, lambda = (1) 2.623915 1 0.10526
mod <- lm(bcPower(prestige, 1.3052) ~ income + education + type, data=Prestige)
summary(mod)
#
# Call:
# lm(formula = bcPower(prestige, 1.3052) ~ income + education +
# type, data = Prestige)
#
# Residuals:
# Min 1Q Median 3Q Max
# -44.843 -13.102 0.287 15.073 62.889
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -3.736e+01 1.639e+01 -2.279 0.0250 *
# income 3.363e-03 6.928e-04 4.854 4.87e-06 ***
# education 1.205e+01 2.009e+00 5.999 3.78e-08 ***
# typeprof 2.027e+01 1.213e+01 1.672 0.0979 .
# typewc -1.078e+01 7.884e+00 -1.368 0.1746
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.25 on 93 degrees of freedom
# (4 observations deleted due to missingness)
# Multiple R-squared: 0.8492, Adjusted R-squared: 0.8427
# F-statistic: 131 on 4 and 93 DF, p-value: < 2.2e-16
Powers (more often represented by ^ than ** in R, FWIW) have a special meaning inside formulas [they represent interactions among variables rather than mathematical operations]. So if you did want to power-transform both sides of your equation you would use the I() or "as-is" operator:
I(Percent_below_poverty_level^0.02020) ~ I(Per_capita_income^0.02020)
However, I think you should do what #DaveArmstrong suggested anyway:
it's only the predictor variable that gets transformed
the Box-Cox transformation is actually (y^lambda-1)/lambda (although the shift and scale might not matter for your results)
I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.
In order to correct heteroskedasticity in error terms, I am running the following weighted least squares regression in R :
#Call:
#lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
#Weighted Residuals:
# Min 1Q Median 3Q Max
#-1.83779 -0.33226 0.02011 0.25135 1.48516
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) -3.939440 0.609991 -6.458 1.62e-09 ***
#q 0.175019 0.070101 2.497 0.013696 *
#q2 0.048790 0.005613 8.693 8.49e-15 ***
#b 0.473891 0.134918 3.512 0.000598 ***
#c 0.119551 0.125430 0.953 0.342167
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.5096 on 140 degrees of freedom
#Multiple R-squared: 0.9639, Adjusted R-squared: 0.9628
#F-statistic: 933.6 on 4 and 140 DF, p-value: < 2.2e-16
Where "weighting" is a variable (function of the variable q) used for weighting the observations. q2 is simply q^2.
Now, to double-check my results, I manually weight my variables by creating new weighted variables :
mydata$a.wls <- mydata$a * mydata$weighting
mydata$q.wls <- mydata$q * mydata$weighting
mydata$q2.wls <- mydata$q2 * mydata$weighting
mydata$b.wls <- mydata$b * mydata$weighting
mydata$c.wls <- mydata$c * mydata$weighting
And run the following regression, without the weights option, and without a constant - since the constant is weighted, the column of 1 in the original predictor matrix should now equal the variable weighting:
Call:
lm(formula = a.wls ~ 0 + weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
#Residuals:
# Min 1Q Median 3Q Max
#-2.38404 -0.55784 0.01922 0.49838 2.62911
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#weighting -4.125559 0.579093 -7.124 5.05e-11 ***
#q.wls 0.217722 0.081851 2.660 0.008726 **
#q2.wls 0.045664 0.006229 7.330 1.67e-11 ***
#b.wls 0.466207 0.121429 3.839 0.000186 ***
#c.wls 0.133522 0.112641 1.185 0.237876
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.915 on 140 degrees of freedom
#Multiple R-squared: 0.9823, Adjusted R-squared: 0.9817
#F-statistic: 1556 on 5 and 140 DF, p-value: < 2.2e-16
As you can see, the results are similar but not identical. Am I doing something wrong while manually weighting the variables, or does the option "weights" do something more than simply multiplying the variables by the weighting vector?
Provided you do manual weighting correctly, you won't see discrepancy.
So the correct way to go is:
X <- model.matrix(~ q + q2 + b + c, mydata) ## non-weighted model matrix (with intercept)
w <- mydata$weighting ## weights
rw <- sqrt(w) ## root weights
y <- mydata$a ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
## remember to drop intercept when using formula
fit_by_wls <- lm(y ~ X - 1, weights = w)
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
Although it is generally recommended to use lm.fit and lm.wfit when passing in matrix directly:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
But when using these internal subroutines lm.fit and lm.wfit, it is required that all input are complete cases without NA, otherwise the underlying C routine stats:::C_Cdqrls will complain.
If you still want to use the formula interface rather than matrix, you can do the following:
## weight by square root of weights, not weights
mydata$root.weighting <- sqrt(mydata$weighting)
mydata$a.wls <- mydata$a * mydata$root.weighting
mydata$q.wls <- mydata$q * mydata$root.weighting
mydata$q2.wls <- mydata$q2 * mydata$root.weighting
mydata$b.wls <- mydata$b * mydata$root.weighting
mydata$c.wls <- mydata$c * mydata$root.weighting
fit_by_wls <- lm(formula = a ~ q + q2 + b + c, data = mydata, weights = weighting)
fit_by_ols <- lm(formula = a.wls ~ 0 + root.weighting + q.wls + q2.wls + b.wls + c.wls,
data = mydata)
Reproducible Example
Let's use R's built-in data set trees. Use head(trees) to inspect this dataset. There is no NA in this dataset. We aim to fit a model:
Height ~ Girth + Volume
with some random weights between 1 and 2:
set.seed(0); w <- runif(nrow(trees), 1, 2)
We fit this model via weighted regression, either by passing weights to lm, or manually transforming data and calling lm with no weigths:
X <- model.matrix(~ Girth + Volume, trees) ## non-weighted model matrix (with intercept)
rw <- sqrt(w) ## root weights
y <- trees$Height ## non-weighted response
X_tilde <- rw * X ## weighted model matrix (with intercept)
y_tilde <- rw * y ## weighted response
fit_by_wls <- lm(y ~ X - 1, weights = w)
#Call:
#lm(formula = y ~ X - 1, weights = w)
#Coefficients:
#X(Intercept) XGirth XVolume
# 83.2127 -1.8639 0.5843
fit_by_ols <- lm(y_tilde ~ X_tilde - 1)
#Call:
#lm(formula = y_tilde ~ X_tilde - 1)
#Coefficients:
#X_tilde(Intercept) X_tildeGirth X_tildeVolume
# 83.2127 -1.8639 0.5843
So indeed, we see identical results.
Alternatively, we can use lm.fit and lm.wfit:
matfit_by_wls <- lm.wfit(X, y, w)
matfit_by_ols <- lm.fit(X_tilde, y_tilde)
We can check coefficients by:
matfit_by_wls$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
matfit_by_ols$coefficients
#(Intercept) Girth Volume
# 83.2127455 -1.8639351 0.5843191
Again, results are the same.
I have a model formula in the form of
model.all <- lme(Response ~ A + B + C)
I would like to update this model by successively removing a predictor variable from the model, so I would end up with 3 models, specifically:
mod.1 <- lme(Response ~ B + C) ; mod.2 <- lme(Response ~ A + C) ; mod.3 <- lme(Response ~ A + B)
I am thinking of a loop function, so I am aware of the update function, but I have too many predictor variables to manually change the code.
Any suggestions would be appreciated.
I would use combn in this occasion, see the example below:
Example Data
Response <- runif(100)
A <- runif(100)
B <- runif(100)
C <- runif(100)
Solution
a <- c('A','B','C') #the names of your variables
b <- as.data.frame(combn(a,2)) #two-way combinations of those using combn
#create the formula for each model
my_forms <- sapply(b, function(x) paste('Response ~ ', paste(x,collapse=' + ')))
> my_forms #the formulas that will be used in the model
V1 V2 V3
"Response ~ A + B" "Response ~ A + C" "Response ~ B + C"
#run each model
my_models <- lapply(my_forms, function(x) lm(as.formula(x)))
Output
> summary(my_models[[1]])
Call:
lm(formula = as.formula(x))
Residuals:
Min 1Q Median 3Q Max
-0.48146 -0.20745 -0.00247 0.24263 0.58341
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.32415 0.08232 3.938 0.000155 ***
A 0.25404 0.09890 2.569 0.011733 *
B 0.07955 0.10129 0.785 0.434141
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2828 on 97 degrees of freedom
Multiple R-squared: 0.06507, Adjusted R-squared: 0.04579
F-statistic: 3.375 on 2 and 97 DF, p-value: 0.03827
As you can see each model is saved in as a list element in my_models. I find this quite easy to make and run.
this piece of code will return coefficients :intercept , slop1 , slop2
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
return(lm(y~x1+x2)$coef)
res=list();
for(i in 1:n){
x1.bar=x1-x1[i]
x2.bar=x2-x2[i]
res[[i]]=lm.ft(y,x1.bar,x2.bar)
}
If I type:
> res[[1]]
I get:
(Intercept) x1 x2
-0.44803887 0.06398476 -0.62798646
How can we return predicted values,residuals,R square, ..etc?
I need something general to extract whatever I need from the summary?
There are a couple of things going on here.
First, you are better off combining your variables into a data.frame:
df <- data.frame(y=rnorm(10), x1=rnorm(10), x2 = rnorm(10))
fit <- lm(y~x1+x2, data=df)
If you do this, using you model for prediction with a new dataset will be much easier.
Second, some of the statistics of the fit are accessible from the model itself, and some are accessible from summary(fit).
coef <- coefficients(fit) # coefficients
resid <- residuals(fit) # residuals
pred <- predict(fit) # fitted values
rsq <- summary(fit)$r.squared # R-sq for the fit
se <- summary(fit)$sigma # se of the fit
To get the statistics of the coefficients, you need to use summary:
stat.coef <- summary(fit)$coefficients
coef <- stat.coef[,1] # 1st column: coefficients (same as above)
se.coef <- stat.coef[,2] # 2nd column: se for each coef
t.coef <- stat.coef[,3] # 3rd column: t-value for each coef
p.coef <- stat.coef[,4] # 4th column: p-value for each coefficient
In your function, you return just the coefficients. Try returning the whole model:
lm.ft=function(y,x1,x2) lm(y~x1+x2) # You don't need the return statement.
Now try your code, and then run:
summary(res[[1]])
# Call:
# lm(formula = y ~ x1 + x2)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.88518 -0.25311 0.03868 0.43110 0.61753
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.44804 0.32615 -1.374 0.2119
# x1 0.06398 0.24048 0.266 0.7979
# x2 -0.62799 0.26915 -2.333 0.0524 .
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.6149 on 7 degrees of freedom
# Multiple R-squared: 0.5173, Adjusted R-squared: 0.3794
# F-statistic: 3.751 on 2 and 7 DF, p-value: 0.07814
You need predict -
set.seed(1)
n=10
y=rnorm(n)
x1=rnorm(n)
x2=rnorm(n)
lm.ft=function(y,x1,x2)
# return(lm(y~x1+x2)$coef)
return(lm(y~x1+x2))
res=lm.ft(y,x1,x2)
ypredicted <- predict(res)
residuals <- y - ypredicted