I have this small data frame that I want to carry out a TukeyHSD test on.
data.frame': 4 obs. of 4 variables:
$ Species : Factor w/ 4 levels "Anthoxanthum",..: 1 1 1 1
$ Harvest : Factor w/ 4 levels "b","c","d","e": 1 2 3 4
$ Total : num 0.2449 0.1248 0.0722 0.1025
I perform an analysis of variance with aov:
anthox1 <- aov(Total ~ Harvest, data=anthox)
anthox.tukey <- TukeyHSD(anthox1, "Harvest", conf.level = 0.95)
but when I run the TukeyHSD I get this message:
Warning message:
In qtukey(conf.level, length(means), x$df.residual) : NaNs produced
Can anyone help me to fix the problem and also explain why this is happening. I feel like everything is correctly written (code and data) but for some reason it does not want to work.
Since you have exactly one observation per group, you get a perfect fit:
Total <- c(0.2449, 0.1248, 0.0722, 0.1025)
Harvest <- c("b","c","d","e")
anthox1 <- aov(Total ~ Harvest)
summary.lm(anthox1)
#Call:
# aov(formula = Total ~ Harvest)
#
#Residuals:
# ALL 4 residuals are 0: no residual degrees of freedom!
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.2449 NA NA NA
#Harvestc -0.1201 NA NA NA
#Harvestd -0.1727 NA NA NA
#Harveste -0.1424 NA NA NA
#
#Residual standard error: NaN on 0 degrees of freedom
#Multiple R-squared: 1, Adjusted R-squared: NaN
#F-statistic: NaN on 3 and 0 DF, p-value: NA
This means you don't have enough residual degrees of freedom for a Tukey test (or for any statistics).
Related
I run the following model in R:
clmm_br<-clmm(Grado_amenaza~Life_Form + size_max_cm +
leaf_length_mean + petals_length_mean +
silicua_length_mean + bloom_length + categ_color+ (1|Genero) ,
data=brasic1)
I didn't get any warnings or errors but when I run summary(clmm_br) I can't get the p-values:
summary(clmm_br)
Cumulative Link Mixed Model fitted with the Laplace approximation
formula: Grado_amenaza ~ Life_Form + size_max_cm + leaf_length_mean +
petals_length_mean + silicua_length_mean + bloom_length +
categ_color + (1 | Genero)
data: brasic1
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 76 -64.18 160.36 1807(1458) 1.50e-03 NaN
Random effects:
Groups Name Variance Std.Dev.
Genero (Intercept) 0.000000008505 0.00009222
Number of groups: Genero 39
Coefficients:
Estimate Std. Error z value Pr(>|z|)
Life_Form[T.G] 2.233338 NA NA NA
Life_Form[T.Hem] 0.577112 NA NA NA
Life_Form[T.Hyd] -22.632916 NA NA NA
Life_Form[T.Th] -1.227512 NA NA NA
size_max_cm 0.006442 NA NA NA
leaf_length_mean 0.008491 NA NA NA
petals_length_mean 0.091623 NA NA NA
silicua_length_mean -0.036001 NA NA NA
bloom_length -0.844697 NA NA NA
categ_color[T.2] -2.420793 NA NA NA
categ_color[T.3] 1.268585 NA NA NA
categ_color[T.4] 1.049953 NA NA NA
Threshold coefficients:
Estimate Std. Error z value
1|3 -1.171 NA NA
3|4 1.266 NA NA
4|5 1.800 NA NA
(4 observations deleted due to missingness)
I tried with no random effects and excluding the rows with NAs but it's the same.
The structure of my data:
str(brasic1)
tibble[,13] [80 x 13] (S3: tbl_df/tbl/data.frame)
$ ID : num [1:80] 135 137 142 145 287 295 585 593 646 656 ...
$ Genero : chr [1:80] "Alyssum" "Alyssum" "Alyssum" "Alyssum" ...
$ Cons.stat : chr [1:80] "LC" "VU" "VU" "LC" ...
$ Amenazada : num [1:80] 0 1 1 0 1 0 0 1 0 0 ...
$ Grado_amenaza : Factor w/ 5 levels "1","3","4","5",..: 1 2 2 1 4 1 1 2 1 1 ...
$ Life_Form : chr [1:80] "Th" "Hem" "Hem" "Th" ...
$ size_max_cm : num [1:80] 12 6 7 15 20 27 60 62 50 60 ...
$ leaf_length_mean : num [1:80] 7.5 7 11 14.5 31.5 45 90 65 65 39 ...
$ petals_length_mean : num [1:80] 2.2 3.5 5.5 2.55 6 8 10.5 9.5 9.5 2.9 ...
$ silicua_length_mean: num [1:80] 3.5 4 3.5 4 25 47.5 37.5 41.5 17.5 2.9 ...
$ X2n : num [1:80] 32 NA 16 16 NA NA 20 20 18 14 ...
$ bloom_length : num [1:80] 2 1 2 2 2 2 2 2 11 2 ...
$ categ_color : chr [1:80] "1" "4" "4" "4" ...
For a full answer we really need a reproducible example, but I can point to a few things that raise suspicions.
The fact that you can get estimates, but not standard errors, implies that there is something wrong with the Hessian (the estimate of the curvature of the log-likelihood surface at the maximum likelihood estimate), but there are several distinct (possibly overlapping possibilities)
any time you have a "large" parameter estimate (say, absolute value > 10), as in your example (Life_Form[T.Hyd] = -22.632916), it suggests complete separation, i.e. the presence/absence of that parameter perfectly predicts the response. (You can search for that term, e.g. on CrossValidated.) However, complete separation usually leads to absurdly large standard errors (along with the large parameter estimates) rather than to NAs.
you may have perfect multicollinearity, i.e. combinations of your predictor variables that are perfectly (jointly) correlated with other such combinations. Some R estimation procedures can detect and deal with this case (typically by dropping one or more predictors), but clmm might not be able to. (You should be able to construct your model matrix (X <- model.matrix( your_formula, your_data), excluding the random effect from the formula) and then use caret::findLinearCombos(X) to explore this issue.)
More generally, if you want to do reliable inference you may need to cut down the size of your model (not by stepwise or other forms of model selection); a rule of thumb is that you need 10-20 observations per parameter estimated. You're trying to estimate 12 fixed effect parameters plus a few more (ordinal-threshold parameters and random effect variance) from 80 observations ...
In addition to dropping random effects, it may be useful to a diagnosis to fit a regular linear model with lm() (which should tell you something about collinearity, by dropping parameters) or a binomial model based on some threshold grade values (which might help with identifying complete separation).
I tried to use glm for estimate soccer teams strengths.
# data is dataframe (structure on bottom).
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=data)
but get the error:
Error in if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") :
missing value where TRUE/FALSE needed
In addition: Warning message:
In Ops.factor(y, 0) : ‘<’ not meaningful for factors
data:
> data
Team Opponent Goals Home
1 5a51f2589d39c31899cce9d9 5a51f2579d39c31899cce9ce 3 1
2 5a51f2579d39c31899cce9ce 5a51f2589d39c31899cce9d9 0 0
3 5a51f2589d39c31899cce9da 5a51f2579d39c31899cce9cd 3 1
4 5a51f2579d39c31899cce9cd 5a51f2589d39c31899cce9da 0 0
> is.factor(data$Goals)
[1] TRUE
From the "details" section of documentation for glm() function:
A typical predictor has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response.
So you want to make sure your Goals column is numeric:
df <- data.frame( Team= c("5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9da", "5a51f2579d39c31899cce9cd"),
Opponent=c("5a51f2579d39c31899cce9ce", "5a51f2589d39c31899cce9d9", "5a51f2579d39c31899cce9cd", "5a51f2589d39c31899cce9da "),
Goals=c(3,0,3,0),
Home=c(1,0,1,0))
str(df)
#'data.frame': 4 obs. of 4 variables:
# $ Team : Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 3 2 4 1
# $ Opponent: Factor w/ 4 levels "5a51f2579d39c31899cce9cd",..: 2 3 1 4
# $ Goals : num 3 0 3 0
# $ Home : num 1 0 1 0
model <- glm(Goals ~ Home + Team + Opponent, family=poisson(link=log), data=df)
Then here is the output:
> model
Call: glm(formula = Goals ~ Home + Team + Opponent, family = poisson(link = log),
data = df)
Coefficients:
(Intercept) Home Team5a51f2579d39c31899cce9ce
-2.330e+01 2.440e+01 -3.089e-14
Team5a51f2589d39c31899cce9d9 Team5a51f2589d39c31899cce9da Opponent5a51f2579d39c31899cce9ce
-6.725e-15 NA NA
Opponent5a51f2589d39c31899cce9d9 Opponent5a51f2589d39c31899cce9da
NA NA
Degrees of Freedom: 3 Total (i.e. Null); 0 Residual
Null Deviance: 8.318
Residual Deviance: 3.033e-10 AIC: 13.98
I have some error in my code, which I couldn't figure out.
I have a dataframe "a", with:
row.names GM variance stddev skewness correltomarket DEratio
1 MMM 0.9785122 0.9998918 0.9999459 -1.049053 2.932738 0.07252799
Now, I need to find a linear model for the above dataframe with the following code
riskmodel <- lm(formula=((a$GM)~(a$variance)+(a$skewness)+
(a$correltomarket)+(a$DEratio)),data=a)
When I run this code, I get the following summary for the "riskmodel"
Call:
lm(formula = ((a$GM) ~ (a$variance) + (a$skewness) + (a$correlationtomarket) +
(a$DEratio)), data = a)
Residuals:
ALL 1 residuals are 0: no residual degrees of freedom!
Coefficients: (4 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9785 NA NA NA
a$variance NA NA NA NA
a$skewness NA NA NA NA
a$correlationtomarket NA NA NA NA
a$DEratio NA NA NA NA
Residual standard error: NaN on 0 degrees of freedom
I don't understand why and I would be really grateful to anyone who helps me with this. I have no idea whats going wrong.
You only have a single observation in your data.frame. You can't fit a model with 5 parameters with only a single observation. You would need at least six observations to be able to fit parameters and have an estimate of the variance.
Trying to run an uncomplicated regression in R and receiving long list of coefficient values with NAs for standard error and t-value. I've never experienced this before.
Result:
summary(model)
Call:
lm(formula = fed$SPX.Index ~ fed$Fed.Treasuries...MM., data = fed)
Residuals:
ALL 311 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1258.84 NA NA NA
fed$Fed.Treasuries...MM. 1,016,102 0.94 NA NA NA
fed$Fed.Treasuries...MM. 1,030,985 17.72 NA NA NA
fed$Fed.Treasuries...MM. 1,062,061 27.12 NA NA NA
fed$Fed.Treasuries...MM. 917,451 -52.77 NA NA NA
fed$Fed.Treasuries...MM. 949,612 -30.56 NA NA NA
fed$Fed.Treasuries...MM. 967,553 -23.61 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 310 and 0 DF, p-value: NA
head(fed)
X Fed.Treasuries...MM. Reserve.Repurchases Agency.Debt.Held Treasuries.Maturing.in.5.10.years SPX.Index
1 10/1/2008 476,621 93,063 14,500 93,362 1161.06
2 10/8/2008 476,579 77,349 14,105 93,353 984.94
3 10/15/2008 476,555 107,819 14,105 94,336 907.84
4 10/22/2008 476,512 95,987 14,105 94,327 896.78
5 10/29/2008 476,469 94,655 13,620 94,317 930.09
6 11/5/2008 476,456 96,663 13,235 94,312 952.77
You have commas in your numbers in your CSV file, R reads them as characters. Your model then has as many levels as rows, and so is degenerate.
Illustration. Take this CSV file:
1, "1,234", "2,345,565"
2, "2,345", "3,234,543"
3, "3,234", "3,987,766"
Read in, fit first column (numbers) against third column (comma-separated numbers):
> fed = read.csv("commas.csv",head=FALSE)
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
ALL 3 residuals are 0: no residual degrees of freedom!
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1 NA NA NA
V3 3,234,543 1 NA NA NA
V3 3,987,766 2 NA NA NA
Residual standard error: NaN on 0 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: NaN
F-statistic: NaN on 2 and 0 DF, p-value: NA
Note this is exactly what you are getting but with different column names. So this almost certainly must be what you have.
Fix. Convert column:
> fed$V3 = as.numeric(gsub(",","", fed$V3))
> summary(lm(V1~V3, fed))
Call:
lm(formula = V1 ~ V3, data = fed)
Residuals:
1 2 3
0.02522 -0.05499 0.02977
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.875e+00 1.890e-01 -9.922 0.0639 .
V3 1.215e-06 5.799e-08 20.952 0.0304 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.06742 on 1 degrees of freedom
Multiple R-squared: 0.9977, Adjusted R-squared: 0.9955
F-statistic: 439 on 1 and 1 DF, p-value: 0.03036
Repeat over columns as necessary.
I'm doing survival analysis with interval-cesored data and I'm using intcox() function from the intcox package in R, which is based on the coxph function.
The function returns the output without likelihood ratio test values:
> intcox(surv~sexo,data=dados)
Call:
intcox(formula = surv ~ sexo, data = dados)
coef exp(coef) se(coef) z p
sexojuvenil 2.596 13.4 NA NA NA
sexomacho -0.105 0.9 NA NA NA
Likelihood ratio test=NA on 2 df, p=NA n= 156
I don't know why this is happening... Here is the application of the coxph() function to the same data:
> coxph(Surv(dias_seg,status)~sexo,data=dados)
Call:
coxph(formula = Surv(dias_seg, status) ~ sexo, data = dados)
coef exp(coef) se(coef) z p
sexojuvenil 2.320 10.172 0.630 3.684 0.00023
sexomacho -0.169 0.844 0.252 -0.671 0.50000
Likelihood ratio test=9.28 on 2 df, p=0.00967 n= 156, number of events= 77
str(dados$sexo)
Factor w/ 3 levels "femea","juvenil",..: 3 3 3 3 3 3 3 3 3 3 ...
Can you help me to solve this problem?
Thanks in advance.
I was told by Volkmar Henschel (author of intcox package) that the "The fit with intcox gives an object of class ”coxph” without the standard errors of the regression coefficients".
More descriptions on this document:
ftp://ftp.auckland.ac.nz/pub/software/CRAN/doc/vignettes/intcox/intcox.pdf