I have some data:
> xt2
# A tibble: 5 x 3
# Groups: symbol [1]
symbol wavgd1 rowNo
<chr> <dbl> <int>
1 REGI 4.84 2220
2 REGI 0.493 2221
3 REGI -0.0890 2222
4 REGI 0.190 2223
5 REGI -1.93 2224
which, when I process it with lm():
xt2t = lm( formula=wavgd1~rowNo, data=as.data.frame(xt2) )
gives the expected result (fitted.values[5] is the test here):
> summary(xt2t)
Call:
lm(formula = wavgd1 ~ rowNo, data = as.data.frame(xt2))
Residuals:
1 2 3 4 5
1.3723 -1.5937 -0.7907 0.8733 0.1388
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3078.1707 979.5475 3.142 0.0516 .
rowNo -1.3850 0.4408 -3.142 0.0516 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.394 on 3 degrees of freedom
Multiple R-squared: 0.7669, Adjusted R-squared: 0.6892
F-statistic: 9.87 on 1 and 3 DF, p-value: 0.05159
But when I process it using rollapply:
xl = zoo::rollapply(xt2,
width=5,
FUN = function(Z)
{
print( as.data.frame(Z) )
t = lm( formula=wavgd1~rowNo, data=as.data.frame(Z) )
print( summary(t) )
return( t$fitted.values[[5]] )
},
by.column=FALSE,
align="right",
fill=NA)
it returns me the input data:
[1] NA NA NA NA -1.929501
Call:
lm(formula = wavgd1 ~ rowNo, data = as.data.frame(Z))
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.844 NA NA NA
rowNo2221 -4.351 NA NA NA
rowNo2222 -4.933 NA NA NA
rowNo2223 -4.654 NA NA NA
rowNo2224 -6.773 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 4 and 0 DF, p-value: NA
In the rollapply() case it looks like each row is being processed as an individual case rather than en masse?
Not sure why rollapply() is being petulant here, but I switched to using slide:: and all is well:
f5 = function(x) {
r = lm( formula=wavgd1~rowNo, x )
return( r$fitted.values[[length(r$fitted.values)]] )
}
mutate( xt, rval=slide_dbl(xt, ~f5(.x), .before = 4, .complete=TRUE) )
Related
I have 2 dataframes
#dummy df for examples:
set.seed(1)
df1 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
df2 <- data.frame(t = (1:16),
A = sample(20, 16),
B = sample(30, 16),
C = sample(30, 16))
I want to do this for every column in both dataframes (except the t column):
model <- lm(df2$A ~ df1$A, data = NULL)
I have tried something like this:
model <- function(yvar, xvar){
lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
}
lapply(names(data), model)
but it obviously doesn't work. What am i doing wrong?
In the end, what i really want is to get the coefficients and other stuff from the models. But what is stopping me is how to run a linear model with variables from different dataframes multiple times.
the output i want i'll guess it should look something like this:
# [[1]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[2]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
#
# [[3]]
# Call:
# lm(df1$as.name(yvar) ~ df2$as.name(xvar), data = NULL)
#
# Residuals:
# Min 1Q Median 3Q Max
# -0.8809 -0.2318 0.1657 0.3787 0.5533
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -0.013981 0.169805 -0.082 0.936
# predmodex[, 2] 1.000143 0.002357 424.351 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.4584 on 14 degrees of freedom
# Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
# F-statistic: 1.801e+05 on 1 and 14 DF, p-value: < 2.2e-16
Since df1 and df2 have same names you can do this as :
model <- function(var){
lm(df1[[var]] ~ df2[[var]])
}
result <- lapply(names(df1)[-1], model)
result
#[[1]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.1504 -0.4763
#[[2]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 3.0227 0.6374
#[[3]]
#Call:
#lm(formula = df1[[var]] ~ df2[[var]])
#Coefficients:
#(Intercept) df2[[var]]
# 15.4240 0.2411
To get summary statistics from the model you can use broom::tidy :
purrr::map_df(result, broom::tidy, .id = 'model_num')
# model_num term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 (Intercept) 15.2 3.03 5.00 0.000194
#2 1 df2[[var]] -0.476 0.248 -1.92 0.0754
#3 2 (Intercept) 3.02 4.09 0.739 0.472
#4 2 df2[[var]] 0.637 0.227 2.81 0.0139
#5 3 (Intercept) 15.4 4.40 3.50 0.00351
#6 3 df2[[var]] 0.241 0.272 0.888 0.390
Recently, I stumbled upon the fact that Stata and R handle regressions without intercept differently. I'm not a statistician, so please be kind if my vocabulary is not ideal.
I tried to make the example somewhat reproducible. This is my example in R:
> set.seed(20210211)
> df <- data.frame(y = runif(50), x = runif(50))
> df$d <- df$x > 0.5
>
> (tmp <- tempfile("data", fileext = ".csv"))
[1] "C:\\Users\\s1504gl\\AppData\\Local\\Temp\\1\\RtmpYtS6uk\\data1b2c1c4a96.csv"
> write.csv(df, tmp, row.names = FALSE)
>
> summary(lm(y ~ x + d, data = df))
Call:
lm(formula = y ~ x + d, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.4375 0.1038 4.214 0.000113 ***
x -0.1026 0.3168 -0.324 0.747521
dTRUE 0.1513 0.1787 0.847 0.401353
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.03103, Adjusted R-squared: -0.0102
F-statistic: 0.7526 on 2 and 47 DF, p-value: 0.4767
> summary(lm(y ~ x + d + 0, data = df))
Call:
lm(formula = y ~ x + d + 0, data = df)
Residuals:
Min 1Q Median 3Q Max
-0.48651 -0.27449 0.03828 0.22119 0.53347
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x -0.1026 0.3168 -0.324 0.747521
dFALSE 0.4375 0.1038 4.214 0.000113 ***
dTRUE 0.5888 0.2482 2.372 0.021813 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2997 on 47 degrees of freedom
Multiple R-squared: 0.7196, Adjusted R-squared: 0.7017
F-statistic: 40.21 on 3 and 47 DF, p-value: 4.996e-13
And here is what I have in Stata (please note that I have copied the filename from R to Stata):
. import delimited "C:\Users\s1504gl\AppData\Local\Temp\1\RtmpYtS6uk\data1b2c1c4a96.csv"
(3 vars, 50 obs)
. encode d, generate(d_enc)
.
. regress y x i.d_enc
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 47) = 0.75
Model | .135181652 2 .067590826 Prob > F = 0.4767
Residual | 4.22088995 47 .089806169 R-squared = 0.0310
-------------+---------------------------------- Adj R-squared = -0.0102
Total | 4.3560716 49 .08889942 Root MSE = .29968
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | -.1025954 .3168411 -0.32 0.748 -.7399975 .5348067
|
d_enc |
TRUE | .1512977 .1786527 0.85 0.401 -.2081052 .5107007
_cons | .4375371 .103837 4.21 0.000 .2286441 .6464301
------------------------------------------------------------------------------
. regress y x i.d_enc, noconstant
Source | SS df MS Number of obs = 50
-------------+---------------------------------- F(2, 48) = 38.13
Model | 9.23913703 2 4.61956852 Prob > F = 0.0000
Residual | 5.81541777 48 .121154537 R-squared = 0.6137
-------------+---------------------------------- Adj R-squared = 0.5976
Total | 15.0545548 50 .301091096 Root MSE = .34807
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x | .976214 .2167973 4.50 0.000 .5403139 1.412114
|
d_enc |
TRUE | -.2322011 .1785587 -1.30 0.200 -.5912174 .1268151
------------------------------------------------------------------------------
As you can see, the results of the regression with intercept are identical. But if I omit the intercept (+ 0 in R, , noconstant in Stata), the results differ. In R, the intercept is now captured in dFALSE, which is reasonable from what I understand. I don't understand what Stata is doing here. Also the degrees of freedom differ.
My questions:
Can anyone explain to me how Stata is handling this?
How can I replicate Stata's behavior in R?
I believe bas pointed in the right direction, but I am still unsure why both results differ.
I am not attempting to answer the question, but provdide deeper understanding of what stata is doing (by digging into the source of R's lm() function. In the following lines I replicate what lm() does, but jumping over sanity checks and options such as weights, contrasts, etc...
(I cannot yet fully understand why in the second regression (with NO CONSTANT) the dFALSE coefficient captures the effect of the intercept in the default regression (with constant)
set.seed(20210211)
df <- data.frame(y = runif(50), x = runif(50))
df$d <- df$x > 0.5
lm() With Constant
form_default <- as.formula(y ~ x + d)
mod_frame_def <- model.frame(form_default, df)
mod_matrix_def <- model.matrix(object = attr(mod_frame_def, "terms"), mod_frame_def)
head(mod_matrix_def)
#> (Intercept) x dTRUE
#> 1 1 0.7861162 1
#> 2 1 0.2059603 0
#> 3 1 0.9793946 1
#> 4 1 0.8569093 1
#> 5 1 0.8124811 1
#> 6 1 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_def),
x = mod_matrix_def
)$coefficients
#> (Intercept) x dTRUE
#> 0.4375371 -0.1025954 0.1512977
lm() No Constant
form_nocon <- as.formula(y ~ x + d + 0)
mod_frame_nocon <- model.frame(form_nocon, df)
mod_matrix_nocon <- model.matrix(object = attr(mod_frame_nocon, "terms"), mod_frame_nocon)
head(mod_matrix_nocon)
#> x dFALSE dTRUE
#> 1 0.7861162 0 1
#> 2 0.2059603 1 0
#> 3 0.9793946 0 1
#> 4 0.8569093 0 1
#> 5 0.8124811 0 1
#> 6 0.7769280 0 1
stats:::lm.fit(
y = model.response(mod_frame_nocon),
x = mod_matrix_nocon
)$coefficients
#> x dFALSE dTRUE
#> -0.1025954 0.4375371 0.5888348
lm() with as.numeric()
[as indicated in the comments by bas]
form_asnum <- as.formula(y ~ x + as.numeric(d) + 0)
mod_frame_asnum <- model.frame(form_asnum, df)
mod_matrix_asnum <- model.matrix(object = attr(mod_frame_asnum, "terms"), mod_frame_asnum)
head(mod_matrix_asnum)
#> x as.numeric(d)
#> 1 0.7861162 1
#> 2 0.2059603 0
#> 3 0.9793946 1
#> 4 0.8569093 1
#> 5 0.8124811 1
#> 6 0.7769280 1
stats:::lm.fit(
y = model.response(mod_frame_asnum),
x = mod_matrix_asnum
)$coefficients
#> x as.numeric(d)
#> 0.9762140 -0.2322012
Created on 2021-03-18 by the reprex package (v1.0.0)
Using R...
I have a data.frame with five variables.
One of the variables colr has values ranging from 1 to 5.
Defined as an integer with values 1, 2, 3, 4, and 5.
Problem: I would like to build a regression model where the values within colr, the integers 1,2,3,4, and 5 are reported as independent variables with the following names.
1 = Silver,
2 = Blue,
3 = Pink,
4 = Other than Silver, Blue or Pink,
5 = Color Not Reported.
Question: Is there a way to extract or rename these values in a way that is different from the following (as this process does not rename, eg. 1 to Silver in the summary regression output):
lm(dependent variable ~ + I(colr.f == 1) +
I(colr.f == 2) +
I(colr.f == 3) +
I(colr.f == 4) +
I(colr.f == 5),
data = df)
I am open to any method that would allow me to create and name these different values independently but would prefer to see if there is a way to do so using the tidyverse or dplyr as this is something I have to do frequently when building multivariate models.
Thank you for any help.
If you have this:
df <- data.frame(int = sample(5, 20, TRUE), value = rnorm(20))
df
#> int value
#> 1 3 -0.62042198
#> 2 4 0.85009260
#> 3 5 -1.04971518
#> 4 1 -2.58255471
#> 5 1 0.62357772
#> 6 4 0.00286785
#> 7 4 -0.05981318
#> 8 4 0.72961261
#> 9 4 -0.03156315
#> 10 1 -2.05486209
#> 11 5 1.77099554
#> 12 1 1.02790956
#> 13 1 -0.70354012
#> 14 1 0.27353731
#> 15 2 -0.04817215
#> 16 2 0.17151374
#> 17 5 -0.54824346
#> 18 2 0.41123284
#> 19 5 0.05466070
#> 20 1 -0.41029986
You can do this:
library(tidyverse)
df <- df %>% mutate(color = factor(c("red", "green", "orange", "blue", "pink"))[int])
df
#> int value color
#> 1 3 -0.62042198 orange
#> 2 4 0.85009260 blue
#> 3 5 -1.04971518 pink
#> 4 1 -2.58255471 red
#> 5 1 0.62357772 red
#> 6 4 0.00286785 blue
#> 7 4 -0.05981318 blue
#> 8 4 0.72961261 blue
#> 9 4 -0.03156315 blue
#> 10 1 -2.05486209 red
#> 11 5 1.77099554 pink
#> 12 1 1.02790956 red
#> 13 1 -0.70354012 red
#> 14 1 0.27353731 red
#> 15 2 -0.04817215 green
#> 16 2 0.17151374 green
#> 17 5 -0.54824346 pink
#> 18 2 0.41123284 green
#> 19 5 0.05466070 pink
#> 20 1 -0.41029986 red
Which allows a regression like this:
lm(value ~ color, data = df) %>% summary()
#>
#> Call:
#> lm(formula = value ~ color, data = df)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.03595 -0.33687 -0.00447 0.46149 1.71407
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 0.2982 0.4681 0.637 0.534
#> colorgreen -0.1200 0.7644 -0.157 0.877
#> colororange -0.9187 1.1466 -0.801 0.436
#> colorpink -0.2413 0.7021 -0.344 0.736
#> colorred -0.8448 0.6129 -1.378 0.188
#>
#> Residual standard error: 1.047 on 15 degrees of freedom
#> Multiple R-squared: 0.1451, Adjusted R-squared: -0.0829
#> F-statistic: 0.6364 on 4 and 15 DF, p-value: 0.6444
Created on 2020-02-16 by the reprex package (v0.3.0)
I'm not sure I'm understanding your question the right way, but can't you just use
library(dplyr)
df <- df %>%
mutate(color=factor(colr.f, levels=c(1:5), labels=c("silver", "blue", "pink", "not s, b, p", "not reported"))
and then just run the regression on color only.
/edit for clarification. Making up some data:
df <- data.frame(
x=rnorm(100),
color=factor(rep(c(1,2,3,4,5), each=20),
labels=c("Silver", "Blue", "Pink", "Not S, B, P", "Not reported")),
y=rnorm(100, 4))
m1 <- lm(y~x+color, data=df)
m2 <- lm(y~x+color-1, data=df)
summary(m1)
Call:
lm(formula = y ~ x + color, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.93238 0.19312 20.362 <2e-16 ***
x 0.13588 0.09856 1.379 0.171
colorBlue -0.07862 0.27705 -0.284 0.777
colorPink -0.02167 0.27393 -0.079 0.937
colorNot S, B, P 0.15238 0.27221 0.560 0.577
colorNot reported 0.14139 0.27230 0.519 0.605
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.0268, Adjusted R-squared: -0.02496
F-statistic: 0.5177 on 5 and 94 DF, p-value: 0.7623
summary(m2)
Call:
lm(formula = y ~ x + color - 1, data = df)
Residuals:
Min 1Q Median 3Q Max
-1.96394 -0.59647 0.00237 0.56916 2.13392
Coefficients:
Estimate Std. Error t value Pr(>|t|)
x 0.13588 0.09856 1.379 0.171
colorSilver 3.93238 0.19312 20.362 <2e-16 ***
colorBlue 3.85376 0.19570 19.692 <2e-16 ***
colorPink 3.91071 0.19301 20.262 <2e-16 ***
colorNot S, B, P 4.08477 0.19375 21.083 <2e-16 ***
colorNot reported 4.07377 0.19256 21.156 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8606 on 94 degrees of freedom
Multiple R-squared: 0.9578, Adjusted R-squared: 0.9551
F-statistic: 355.5 on 6 and 94 DF, p-value: < 2.2e-16
The first model is a model with intercept, therefore one of the factor levels must be dropped to avoid perfect multicollinearity. In this case, the "effect" of silver is the value of the intercept, while the "effect" of the other colors is the intercept coefficient value + their respective coefficient value.
The second model is estimated without intercept (without constant), so you can see the individual effects. However, you should probably know what you are doing before estimating the model without intercept.
With base R.
labels <- c("Silver", "Blue", "Pink", "Other Color", "Color Not Reported")
df$colr.f2 <- factor(colr.f, labels = labels, levels = seq_along(labels))
I'm new to r and not sure how fix the error I'm getting.
Here is the summary of my data:
> summary(data)
Metro MrktRgn MedAge numHmSales
Abilene : 1 Austin-Waco-Hill Country : 6 20-25: 3 Min. : 302
Amarillo : 1 Far West Texas : 1 25-30: 6 1st Qu.: 1057
Arlington: 1 Gulf Coast - Brazos Bottom:10 30-35:28 Median : 2098
Austin : 1 Northeast Texas :14 35-40: 6 Mean : 7278
Bay Area : 1 Panhandle and South Plains: 5 45-50: 2 3rd Qu.: 5086
Beaumont : 1 South Texas : 7 50-55: 1 Max. :83174
(Other) :40 West Texas : 3
AvgSlPr totNumLs MedHHInc Pop
Min. :123833 Min. : 1257 Min. :37300 Min. : 2899
1st Qu.:149117 1st Qu.: 6028 1st Qu.:53100 1st Qu.: 56876
Median :171667 Median : 11106 Median :57000 Median : 126482
Mean :188637 Mean : 24302 Mean :60478 Mean : 296529
3rd Qu.:215175 3rd Qu.: 25472 3rd Qu.:66200 3rd Qu.: 299321
Max. :303475 Max. :224230 Max. :99205 Max. :2196000
NA's :1
then I make a model with AvSlPr as the y variable and other the other variables as x variables
> model1 = lm(AvgSlPr ~ Metro + MrktRgn + MedAge + numHmSales + totNumLs + MedHHInc + Pop)
but when I do a summary of the model, I get NA for the Std. Error, t value, and t p-values.
> summary(model1)
Call:
lm(formula = AvgSlPr ~ Metro + MrktRgn + MedAge + numHmSales +
totNumLs + MedHHInc + Pop)
Residuals:
ALL 45 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 143175 NA NA NA
MetroAmarillo 24925 NA NA NA
MetroArlington 35258 NA NA NA
MetroAustin 160300 NA NA NA
MetroBay Area 68642 NA NA NA
MetroBeaumont 5942 NA NA NA
...
MrktRgnWest Texas NA NA NA NA
MedAge25-30 NA NA NA NA
MedAge30-35 NA NA NA NA
MedAge35-40 NA NA NA NA
MedAge45-50 NA NA NA NA
MedAge50-55 NA NA NA NA
numHmSales NA NA NA NA
totNumLs NA NA NA NA
MedHHInc NA NA NA NA
Pop NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 44 and 0 DF, p-value: NA
Does anyone know whats going wrong and how I can fix this? Also, I'm not supposed to be using dummy variables.
Your Metro variable always refers to a single line for each factor level. You need at least two points to fit a line. Let me demonstrate with an example:
dat = data.frame(AvgSlPr=runif(4), Metro = factor(LETTERS[1:4]), MrktRgn = runif(4))
model1 = lm(AvgSlPr ~ Metro + MrktRgn, data = dat)
summary(model1)
#Call:
#lm(formula = AvgSlPr ~ Metro + MrktRgn, data = dat)
#Residuals:
#ALL 4 residuals are 0: no residual degrees of freedom!
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.33801 NA NA NA
#MetroB 0.47350 NA NA NA
#MetroC -0.04118 NA NA NA
#MetroD 0.20047 NA NA NA
#MrktRgn NA NA NA NA
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: NaN
#F-statistic: NaN on 3 and 0 DF, p-value: NA
But if we add more data so that at least some of the factor levels have more than one row of data, the linear model can be calculated:
dat = rbind(dat, data.frame(AvgSlPr=2:4, Metro=factor(LETTERS[2:4]), MrktRgn = 3:5))
model2 = lm(AvgSlPr ~ Metro + MrktRgn, data=dat)
summary(model2)
#Call:
#lm(formula = AvgSlPr ~ Metro + MrktRgn, data = dat)
#Residuals:
# 1 2 3 4 5 6 7
# 9.021e-17 2.643e-01 7.304e-03 -1.498e-01 -2.643e-01 -7.304e-03 1.498e-01
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.24279 0.30406 0.798 0.50834
#MetroB -0.10207 0.38858 -0.263 0.81739
#MetroC -0.06696 0.39471 -0.170 0.88090
#MetroD 0.06804 0.41243 0.165 0.88413
#MrktRgn 0.70787 0.06747 10.491 0.00896 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.3039 on 2 degrees of freedom
#Multiple R-squared: 0.9857, Adjusted R-squared: 0.9571
#F-statistic: 34.45 on 4 and 2 DF, p-value: 0.02841
The data used to fit the model need be re-thought. What is the goal of the analysis? What data are needed to achieve the goal?
Trying to run an uncomplicated regression in R and receiving long list of coefficient values with NAs for standard error and t-value. I've never experienced this before.
Result:
summary(model)
Call:
lm(formula = fed$SPX.Index ~ fed$Fed.Treasuries...MM., data = fed)
Residuals:
ALL 311 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1258.84 NA NA NA
fed$Fed.Treasuries...MM. 1,016,102 0.94 NA NA NA
fed$Fed.Treasuries...MM. 1,030,985 17.72 NA NA NA
fed$Fed.Treasuries...MM. 1,062,061 27.12 NA NA NA
fed$Fed.Treasuries...MM. 917,451 -52.77 NA NA NA
fed$Fed.Treasuries...MM. 949,612 -30.56 NA NA NA
fed$Fed.Treasuries...MM. 967,553 -23.61 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 310 and 0 DF, p-value: NA
head(fed)
X Fed.Treasuries...MM. Reserve.Repurchases Agency.Debt.Held Treasuries.Maturing.in.5.10.years SPX.Index
1 10/1/2008 476,621 93,063 14,500 93,362 1161.06
2 10/8/2008 476,579 77,349 14,105 93,353 984.94
3 10/15/2008 476,555 107,819 14,105 94,336 907.84
4 10/22/2008 476,512 95,987 14,105 94,327 896.78
5 10/29/2008 476,469 94,655 13,620 94,317 930.09
6 11/5/2008 476,456 96,663 13,235 94,312 952.77
You have commas in your numbers in your CSV file, R reads them as characters. Your model then has as many levels as rows, and so is degenerate.
Illustration. Take this CSV file:
1, "1,234", "2,345,565"
2, "2,345", "3,234,543"
3, "3,234", "3,987,766"
Read in, fit first column (numbers) against third column (comma-separated numbers):
> fed = read.csv("commas.csv",head=FALSE)
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1 NA NA NA
V3 3,234,543 1 NA NA NA
V3 3,987,766 2 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 2 and 0 DF, p-value: NA
Note this is exactly what you are getting but with different column names. So this almost certainly must be what you have.
Fix. Convert column:
> fed$V3 = as.numeric(gsub(",","", fed$V3))
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
1 2 3
0.02522 -0.05499 0.02977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.875e+00 1.890e-01 -9.922 0.0639 .
V3 1.215e-06 5.799e-08 20.952 0.0304 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06742 on 1 degrees of freedom
Multiple R-squared: 0.9977, Adjusted R-squared: 0.9955
F-statistic: 439 on 1 and 1 DF, p-value: 0.03036
Repeat over columns as necessary.