Summary of model returning NA - r

I'm new to r and not sure how fix the error I'm getting.
Here is the summary of my data:
> summary(data)
Metro MrktRgn MedAge numHmSales
Abilene : 1 Austin-Waco-Hill Country : 6 20-25: 3 Min. : 302
Amarillo : 1 Far West Texas : 1 25-30: 6 1st Qu.: 1057
Arlington: 1 Gulf Coast - Brazos Bottom:10 30-35:28 Median : 2098
Austin : 1 Northeast Texas :14 35-40: 6 Mean : 7278
Bay Area : 1 Panhandle and South Plains: 5 45-50: 2 3rd Qu.: 5086
Beaumont : 1 South Texas : 7 50-55: 1 Max. :83174
(Other) :40 West Texas : 3
AvgSlPr totNumLs MedHHInc Pop
Min. :123833 Min. : 1257 Min. :37300 Min. : 2899
1st Qu.:149117 1st Qu.: 6028 1st Qu.:53100 1st Qu.: 56876
Median :171667 Median : 11106 Median :57000 Median : 126482
Mean :188637 Mean : 24302 Mean :60478 Mean : 296529
3rd Qu.:215175 3rd Qu.: 25472 3rd Qu.:66200 3rd Qu.: 299321
Max. :303475 Max. :224230 Max. :99205 Max. :2196000
NA's :1
then I make a model with AvSlPr as the y variable and other the other variables as x variables
> model1 = lm(AvgSlPr ~ Metro + MrktRgn + MedAge + numHmSales + totNumLs + MedHHInc + Pop)
but when I do a summary of the model, I get NA for the Std. Error, t value, and t p-values.
> summary(model1)
Call:
lm(formula = AvgSlPr ~ Metro + MrktRgn + MedAge + numHmSales +
totNumLs + MedHHInc + Pop)
Residuals:
ALL 45 residuals are 0: no residual degrees of freedom!
Coefficients: (15 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 143175 NA NA NA
MetroAmarillo 24925 NA NA NA
MetroArlington 35258 NA NA NA
MetroAustin 160300 NA NA NA
MetroBay Area 68642 NA NA NA
MetroBeaumont 5942 NA NA NA
...
MrktRgnWest Texas NA NA NA NA
MedAge25-30 NA NA NA NA
MedAge30-35 NA NA NA NA
MedAge35-40 NA NA NA NA
MedAge45-50 NA NA NA NA
MedAge50-55 NA NA NA NA
numHmSales NA NA NA NA
totNumLs NA NA NA NA
MedHHInc NA NA NA NA
Pop NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 44 and 0 DF, p-value: NA
Does anyone know whats going wrong and how I can fix this? Also, I'm not supposed to be using dummy variables.

Your Metro variable always refers to a single line for each factor level. You need at least two points to fit a line. Let me demonstrate with an example:
dat = data.frame(AvgSlPr=runif(4), Metro = factor(LETTERS[1:4]), MrktRgn = runif(4))
model1 = lm(AvgSlPr ~ Metro + MrktRgn, data = dat)
summary(model1)
#Call:
#lm(formula = AvgSlPr ~ Metro + MrktRgn, data = dat)
#Residuals:
#ALL 4 residuals are 0: no residual degrees of freedom!
#Coefficients: (1 not defined because of singularities)
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.33801 NA NA NA
#MetroB 0.47350 NA NA NA
#MetroC -0.04118 NA NA NA
#MetroD 0.20047 NA NA NA
#MrktRgn NA NA NA NA
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: NaN
#F-statistic: NaN on 3 and 0 DF, p-value: NA
But if we add more data so that at least some of the factor levels have more than one row of data, the linear model can be calculated:
dat = rbind(dat, data.frame(AvgSlPr=2:4, Metro=factor(LETTERS[2:4]), MrktRgn = 3:5))
model2 = lm(AvgSlPr ~ Metro + MrktRgn, data=dat)
summary(model2)
#Call:
#lm(formula = AvgSlPr ~ Metro + MrktRgn, data = dat)
#Residuals:
# 1 2 3 4 5 6 7
# 9.021e-17 2.643e-01 7.304e-03 -1.498e-01 -2.643e-01 -7.304e-03 1.498e-01
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.24279 0.30406 0.798 0.50834
#MetroB -0.10207 0.38858 -0.263 0.81739
#MetroC -0.06696 0.39471 -0.170 0.88090
#MetroD 0.06804 0.41243 0.165 0.88413
#MrktRgn 0.70787 0.06747 10.491 0.00896 **
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 0.3039 on 2 degrees of freedom
#Multiple R-squared: 0.9857, Adjusted R-squared: 0.9571
#F-statistic: 34.45 on 4 and 2 DF, p-value: 0.02841
The data used to fit the model need be re-thought. What is the goal of the analysis? What data are needed to achieve the goal?

Related

clogit output producing NAs but partially show up when I use cbind

(updated)
Hey Stack Overflowers,
I'm trying to run a series of MLM fixed-effect logistic regressions with R's clogit function. When I add additional covariates to my model, the summary output shows NAs. But, when I used the cbind function, some of the missing covariate coefficients show up.
Here's my model 1 equation and output:
> model1 <- clogit(chldwork~lag_aspgrade_binned+age+strata(childid), data=finaletdtlag, method = 'exact')
> summary(model1)
Call:
coxph(formula = Surv(rep(1, 2686L), chldwork) ~ lag_aspgrade_binned +
age + strata(childid), data = finaletdtlag, method = "exact")
n= 2686, number of events= 2287
coef exp(coef) se(coef) z Pr(>|z|)
lag_aspgrade_binnedhigh school 1.04156 2.83363 0.52572 1.981 0.04757 *
lag_aspgrade_binnedno primary 1.31891 3.73935 0.89010 1.482 0.13841
lag_aspgrade_binnedprimary some hs 0.85000 2.33964 0.56244 1.511 0.13072
lag_aspgrade_binnedsome college 1.28607 3.61855 0.41733 3.082 0.00206 **
age -0.39600 0.67301 0.03105 -12.753 < 2e-16 ***
Here's my model two equation:
model2
<- clogit(chldwork~lag_aspgrade_binned+age+sex+chldeth+typesite+selfwlth+enroll+strata(childid), data=finaletdtlag, method = 'exact')
summary(model2)
Here's the summary output:
> summary(model2)
Call:
coxph(formula = Surv(rep(1, 2686L), chldwork) ~ lag_aspgrade_binned +
age + sex + chldeth + typesite + selfwlth + enroll + strata(childid),
data = finaletdtlag, method = "efron")
n= 2675, number of events= 2277
(11 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
lag_aspgrade_binnedhigh school 0.32943 1.39018 0.13933 2.364 0.0181 *
lag_aspgrade_binnedno primary 0.46553 1.59286 0.25154 1.851 0.0642 .
lag_aspgrade_binnedprimary some hs 0.33477 1.39762 0.15728 2.128 0.0333 *
lag_aspgrade_binnedsome college 0.36268 1.43718 0.11792 3.076 0.0021 **
age -0.07638 0.92647 0.01020 -7.486 7.11e-14 ***
sex1 NA NA 0.00000 NA NA
chldeth2 NA NA 0.00000 NA NA
chldeth3 NA NA 0.00000 NA NA
chldeth4 NA NA 0.00000 NA NA
chldeth6 NA NA 0.00000 NA NA
chldeth7 NA NA 0.00000 NA NA
chldeth8 NA NA 0.00000 NA NA
chldeth9 NA NA 0.00000 NA NA
chldeth99 NA NA 0.00000 NA NA
typesite1 NA NA 0.00000 NA NA
selfwlth1 0.04031 1.04113 0.29201 0.138 0.8902
selfwlth2 0.11971 1.12717 0.28736 0.417 0.6770
selfwlth3 0.07928 1.08251 0.29189 0.272 0.7859
selfwlth4 0.05717 1.05884 0.30231 0.189 0.8500
selfwlth5 0.39709 1.48750 0.43653 0.910 0.3630
selfwlth99 NA NA 0.00000 NA NA
enroll1 -0.20443 0.81511 0.08890 -2.300 0.0215 *
enroll88 NA NA 0.00000 NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
But here's what happens when I use the cbind function to show all of my models next to each other. Note, that coefficients sex through chldeth99 are not in model one.
cbind results:
> cbind(coef(model1), (coef(model2)), coef(model3)) #creating side by side list of all model coefficients
[,1] [,2] [,3]
lag_aspgrade_binnedhigh school 1.0415583 0.27198991 0.32827106
lag_aspgrade_binnedno primary 1.3189131 0.37986205 0.46103492
lag_aspgrade_binnedprimary some hs 0.8499958 0.27831739 0.33256493
lag_aspgrade_binnedsome college 1.2860726 0.30089261 0.36214068
age -0.3960015 -0.06233958 -0.07653464
sex1 1.0415583 NA NA
chldeth2 1.3189131 NA NA
chldeth3 0.8499958 NA NA
chldeth4 1.2860726 NA NA
chldeth6 -0.3960015 NA NA
chldeth7 1.0415583 NA NA
chldeth8 1.3189131 NA NA
chldeth9 0.8499958 NA NA
chldeth99 1.2860726 NA NA
typesite1 -0.3960015 NA NA
selfwlth1 1.0415583 0.03245507 0.04424493
selfwlth2 1.3189131 0.09775395 0.12743276
selfwlth3 0.8499958 0.06499650 0.08854499
selfwlth4 1.2860726 0.05038224 0.07092755
selfwlth5 -0.3960015 0.32162830 0.38079232
selfwlth99 1.0415583 NA NA
enroll1 1.3189131 -0.16966609 -0.30366842
enroll88 0.8499958 NA NA
sex1:enroll1 1.2860726 0.27198991 0.24088361
sex1:enroll88 -0.3960015 0.37986205 NA
Much gratitude for any insights. Wishing you all the best as another year wraps up--special shoutout to those currently in school still on the grind.

lm() on zoo:rollapply() gives odd result

I have some data:
> xt2
# A tibble: 5 x 3
# Groups: symbol [1]
symbol wavgd1 rowNo
<chr> <dbl> <int>
1 REGI 4.84 2220
2 REGI 0.493 2221
3 REGI -0.0890 2222
4 REGI 0.190 2223
5 REGI -1.93 2224
which, when I process it with lm():
xt2t = lm( formula=wavgd1~rowNo, data=as.data.frame(xt2) )
gives the expected result (fitted.values[5] is the test here):
> summary(xt2t)
Call:
lm(formula = wavgd1 ~ rowNo, data = as.data.frame(xt2))
Residuals:
1 2 3 4 5
1.3723 -1.5937 -0.7907 0.8733 0.1388
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3078.1707 979.5475 3.142 0.0516 .
rowNo -1.3850 0.4408 -3.142 0.0516 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.394 on 3 degrees of freedom
Multiple R-squared: 0.7669, Adjusted R-squared: 0.6892
F-statistic: 9.87 on 1 and 3 DF, p-value: 0.05159
But when I process it using rollapply:
xl = zoo::rollapply(xt2,
width=5,
FUN = function(Z)
{
print( as.data.frame(Z) )
t = lm( formula=wavgd1~rowNo, data=as.data.frame(Z) )
print( summary(t) )
return( t$fitted.values[[5]] )
},
by.column=FALSE,
align="right",
fill=NA)
it returns me the input data:
[1] NA NA NA NA -1.929501
Call:
lm(formula = wavgd1 ~ rowNo, data = as.data.frame(Z))
Residuals:
ALL 5 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.844 NA NA NA
rowNo2221 -4.351 NA NA NA
rowNo2222 -4.933 NA NA NA
rowNo2223 -4.654 NA NA NA
rowNo2224 -6.773 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 4 and 0 DF, p-value: NA
In the rollapply() case it looks like each row is being processed as an individual case rather than en masse?
Not sure why rollapply() is being petulant here, but I switched to using slide:: and all is well:
f5 = function(x) {
r = lm( formula=wavgd1~rowNo, x )
return( r$fitted.values[[length(r$fitted.values)]] )
}
mutate( xt, rval=slide_dbl(xt, ~f5(.x), .before = 4, .complete=TRUE) )

Fitting a linear regression model in R with confounding variables

I have a dataset called datamoth where survival is the response variable and treatment is a variable that can be considered both categorical and quantitative. The dataset looks like follows:
survival <- c(17,22,26,20,11,14,37,26,24,11,11,16,8,5,12,3,5,4,14,8,4,6,3,3,10,13,5,7,3,3)
treatment <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
days <- c(3,3,3,3,3,3,6,6,6,6,6,6,9,9,9,9,9,9,12,12,12,12,12,12,21,21,21,21,21,21)
datamoth <- data.frame(survival, treatment)
So, I can fit a linear regression model considering treatment as categorical, like this:
lmod<-lm(survival ~ factor(treatment), datamoth)
My question is how to fit a linear regression model with treatment as categorical variable but also considering treatment as a quantitative confounding variable.
I have figured out something like this:
model <- lm(survival ~ factor(treatment) + factor(treatment)*days, data = datamoth)
summary(model)
Call:
lm(formula = survival ~ factor(treatment) + factor(treatment) *
days, data = datamoth)
Residuals:
Min 1Q Median 3Q Max
-9.833 -3.333 -1.167 3.167 16.167
Coefficients: (5 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.333 2.435 7.530 6.96e-08 ***
factor(treatment)6 2.500 3.443 0.726 0.47454
factor(treatment)9 -12.167 3.443 -3.534 0.00162 **
factor(treatment)12 -12.000 3.443 -3.485 0.00183 **
factor(treatment)21 -11.500 3.443 -3.340 0.00263 **
days NA NA NA NA
factor(treatment)6:days NA NA NA NA
factor(treatment)9:days NA NA NA NA
factor(treatment)12:days NA NA NA NA
factor(treatment)21:days NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.964 on 25 degrees of freedom
Multiple R-squared: 0.5869, Adjusted R-squared: 0.5208
F-statistic: 8.879 on 4 and 25 DF, p-value: 0.0001324
But obviously this code is not working because these two variables are collinear.
Does anyone to know how to fix it? Any help will be appreciated.

Keep getting NaNs when carrying out TukeyHSD in R

I have this small data frame that I want to carry out a TukeyHSD test on.
data.frame': 4 obs. of 4 variables:
$ Species : Factor w/ 4 levels "Anthoxanthum",..: 1 1 1 1
$ Harvest : Factor w/ 4 levels "b","c","d","e": 1 2 3 4
$ Total : num 0.2449 0.1248 0.0722 0.1025
I perform an analysis of variance with aov:
anthox1 <- aov(Total ~ Harvest, data=anthox)
anthox.tukey <- TukeyHSD(anthox1, "Harvest", conf.level = 0.95)
but when I run the TukeyHSD I get this message:
Warning message:
In qtukey(conf.level, length(means), x$df.residual) : NaNs produced
Can anyone help me to fix the problem and also explain why this is happening. I feel like everything is correctly written (code and data) but for some reason it does not want to work.
Since you have exactly one observation per group, you get a perfect fit:
Total <- c(0.2449, 0.1248, 0.0722, 0.1025)
Harvest <- c("b","c","d","e")
anthox1 <- aov(Total ~ Harvest)
summary.lm(anthox1)
#Call:
# aov(formula = Total ~ Harvest)
#
#Residuals:
# ALL 4 residuals are 0: no residual degrees of freedom!
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2449 NA NA NA
#Harvestc -0.1201 NA NA NA
#Harvestd -0.1727 NA NA NA
#Harveste -0.1424 NA NA NA
#
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: NaN
#F-statistic: NaN on 3 and 0 DF, p-value: NA
This means you don't have enough residual degrees of freedom for a Tukey test (or for any statistics).

Regression summary in R returns a bunch of NAs

Trying to run an uncomplicated regression in R and receiving long list of coefficient values with NAs for standard error and t-value. I've never experienced this before.
Result:
summary(model)
Call:
lm(formula = fed$SPX.Index ~ fed$Fed.Treasuries...MM., data = fed)
Residuals:
ALL 311 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1258.84 NA NA NA
fed$Fed.Treasuries...MM. 1,016,102 0.94 NA NA NA
fed$Fed.Treasuries...MM. 1,030,985 17.72 NA NA NA
fed$Fed.Treasuries...MM. 1,062,061 27.12 NA NA NA
fed$Fed.Treasuries...MM. 917,451 -52.77 NA NA NA
fed$Fed.Treasuries...MM. 949,612 -30.56 NA NA NA
fed$Fed.Treasuries...MM. 967,553 -23.61 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 310 and 0 DF, p-value: NA
head(fed)
X Fed.Treasuries...MM. Reserve.Repurchases Agency.Debt.Held Treasuries.Maturing.in.5.10.years SPX.Index
1 10/1/2008 476,621 93,063 14,500 93,362 1161.06
2 10/8/2008 476,579 77,349 14,105 93,353 984.94
3 10/15/2008 476,555 107,819 14,105 94,336 907.84
4 10/22/2008 476,512 95,987 14,105 94,327 896.78
5 10/29/2008 476,469 94,655 13,620 94,317 930.09
6 11/5/2008 476,456 96,663 13,235 94,312 952.77
You have commas in your numbers in your CSV file, R reads them as characters. Your model then has as many levels as rows, and so is degenerate.
Illustration. Take this CSV file:
1, "1,234", "2,345,565"
2, "2,345", "3,234,543"
3, "3,234", "3,987,766"
Read in, fit first column (numbers) against third column (comma-separated numbers):
> fed = read.csv("commas.csv",head=FALSE)
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1 NA NA NA
V3 3,234,543 1 NA NA NA
V3 3,987,766 2 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 2 and 0 DF, p-value: NA
Note this is exactly what you are getting but with different column names. So this almost certainly must be what you have.
Fix. Convert column:
> fed$V3 = as.numeric(gsub(",","", fed$V3))
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
1 2 3
0.02522 -0.05499 0.02977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.875e+00 1.890e-01 -9.922 0.0639 .
V3 1.215e-06 5.799e-08 20.952 0.0304 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06742 on 1 degrees of freedom
Multiple R-squared: 0.9977, Adjusted R-squared: 0.9955
F-statistic: 439 on 1 and 1 DF, p-value: 0.03036
Repeat over columns as necessary.

Resources