I would be glad if somebody could help me to solve this problem. I have data with repeated measurements design, where we tested a reaction of birds (time.dep) before and after the infection (exper). We have also FL (fuel loads, % of lean body mass), fat score and group (Experimental vs Control) as explanatory variables. I decided to use LME, because distribution of residuals doesn’t deviate from normality. But there is a problem with homogeneity of residuals. Variances of groups “before” and “after” and also between fat levels differ significantly (Fligner-Killeen test, p=0.038 and p=0.01 respectively).
ring group fat time.dep FL exper
1 XZ13125 E 4 0.36 16.295 before
2 XZ13125 E 3 0.32 12.547 after
3 XZ13126 E 3 0.28 7.721 before
4 XZ13127 C 3 0.32 9.157 before
5 XZ13127 C 3 0.40 -1.902 after
6 XZ13129 C 4 0.40 10.382 before
After I have selected the random part of the model, which is random-intercept (~1|ring), I have applied the weight parameter for both “fat” and “exper” (varComb(varIdent(form=~1|fat), varIdent(form=~1|exper)). Now the plot of standardized residuals vs. fitted looks better, but I still get the violation of homogeneity for these variables (same values in fligner test). What do I do wrong?
A common trap in lme is that the default is to give raw residuals, i.e. not adjusted for any of the heteroscedasticity (weights) or correlation (correlation) sub-models that may have been used. From ?residuals.lme:
type: an optional character string specifying the type of residuals
to be used. If ‘"response"’, as by default, the “raw”
residuals (observed - fitted) are used; else, if ‘"pearson"’,
the standardized residuals (raw residuals divided by the
corresponding standard errors) are used; else, if
‘"normalized"’, the normalized residuals (standardized
residuals pre-multiplied by the inverse square-root factor of
the estimated error correlation matrix) are used. Partial
matching of arguments is used, so only the first character
needs to be provided.
Thus if you want your residuals to be corrected for heteroscedasticity (as included in the model) you need type="pearson"; if you want them to be corrected for correlation, you need type="normalized".
Related
I am predicting species' environmental suitability scores. To do so, I make use of the caret package and run classification problems. The literature advocates for the use of the continuous Boyce index (CBI) as a performance measure for model reliability (see 1). I currently tune the models to maximize AUC not the CBI. Nonetheless, I obtain perfect CBI scores (correlation of 1), for both the training and testing data, using ecospat::ecospat.boyce as well as the adamlilith/enmSdm::contBoyce2x available via rdrr.io.
I am not sure whether I am making a mistake and input the wrong vectors. The relevant logistic regression output looks something like this:
head(lg$pred)
pred obs absence presence rowIndex parameter Resample
1 presence absence 0.4144518 0.5855482 1 none Fold1
2 presence presence 0.4402172 0.5597828 2 none Fold1
3 presence presence 0.3647270 0.6352730 3 none Fold1
4 absence absence 0.7154779 0.2845221 6 none Fold1
5 presence presence 0.1574952 0.8425048 9 none Fold1
6 presence presence 0.0146231 0.9853769 10 none Fold1
I call the functions as follows
Ecospat::ecospat.boyce
fit = A vector or Raster-Layer containing the predicted suitability values
obs = A vector containing the predicted suitability values or xy-coordinates (if "fit" is a Raster-Layer) of the validation points (presence records)
ecospat::ecospat.boyce(fit=lg$pred$presence,obs=lg$pred$presence[lg$pred$obs=="presence"],PEplot=T)
adamlilith/enmSdm::contBoyce2x
pres = Numeric vector. Predicted values at presence sites.
bg = Numeric vector. Predicted values at absence/background sites.
contBoyce2x(pres=lg$pred$presence[lg$pred$obs=="presence"],bg=lg$pred$presence[lg$pred$obs=="absence"],graph=T)
Have I understood the input arguments correctly? Is it possible to obtain a perfect CBI score without even trying to tune the models to maximize this metric?
(1) Hirzel, A.H., Le Lay, G., Helfer, V., Randin, C., Guisan, A. (2006). Evaluating the ability of the habitat suitability models to predict species presences. Ecological Modelling 199, 142-152.
I've done a good amount of googling and the explanations either don't make any sense or they say just use factors instead of ordinal data. I understand that the ``.Lis linear,.Q` is quadratic, ... etc. But I don't know how to actually say what it means. So for example let's say
Primary.L 7.73502 0.984
Primary.Q 6.81674 0.400
Primary.C -4.07055 0.450
Primary^4 1.48845 0.600
where the first column is the variable, second is the estimate, and the third is the p-value. What would I be saying about the variables as they increase in order? Is this basically saying what model I would use so this would be 7.73502x + 6.81674x^2 - 4.07055x^3 is how the model is? Or would it just include quadratic? All of this is so confusing. If anyone could shine a light into how to interpret these .L, .Q, .C, etc., that would be fantastic.
example
> summary(glm(DEPENDENT ~ Year, data = HAVE, family = "binomial"))
Call:
glm(formula = DEPENDENT ~ Year, family = "binomial", data = HAVE)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.3376 -0.2490 -0.2155 -0.1635 3.1802
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.572966 0.028179 -126.798 < 2e-16 ***
Year.L -2.212443 0.150295 -14.721 < 2e-16 ***
Year.Q -0.932844 0.162011 -5.758 8.52e-09 ***
Year.C 0.187344 0.156462 1.197 0.2312
Year^4 -0.595352 0.147113 -4.047 5.19e-05 ***
Year^5 -0.027306 0.135214 -0.202 0.8400
Year^6 -0.023756 0.120969 -0.196 0.8443
Year^7 0.079723 0.111786 0.713 0.4757
Year^8 -0.080749 0.103615 -0.779 0.4358
Year^9 -0.117472 0.098423 -1.194 0.2327
Year^10 -0.134956 0.095098 -1.419 0.1559
Year^11 -0.106700 0.089791 -1.188 0.2347
Year^12 0.102289 0.088613 1.154 0.2484
Year^13 0.125736 0.084283 1.492 0.1357
Year^14 -0.009941 0.084058 -0.118 0.9059
Year^15 -0.173013 0.088781 -1.949 0.0513 .
Year^16 -0.146597 0.090398 -1.622 0.1049
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 18687 on 80083 degrees of freedom
Residual deviance: 18120 on 80067 degrees of freedom
AIC: 18154
Number of Fisher Scoring iterations: 7
That output indicates that your predictor Year is an "ordered factor" meaning R not only understands observations within that variable to be distinct categories or groups (i.e., a factor) but also that the various categories have a natural order to them where one category is considered larger than another.
In this situation, R's default is to fit a series of polynomial functions or contrasts to the levels of the variable. The first is linear (.L), the second is quadratic (.Q), the third is cubic (.C), and so on. R will fit one fewer polynomial functions than the number of available levels. Thus, your output indicates there are 17 distinct years in your data.
You can probably think of those 17 (counting the intercept) predictors in your output as entirely new variables all based on the order of your original variable because R creates them using special values that make all the new predictors orthogonal (i.e., unrelated, linearly independent, or uncorrelated) to each other.
One way to see the values that got used is to use the model.matrix() function on your model object.
model.matrix(glm(DEPENDENT ~ Year, data = HAVE, family = "binomial"))
If you run the above, you will find a bunch of repeated numbers within each of the new variable columns where the changes in repetition correspond to where your original Year predictor switched categories. The specific values themselves hold no real meaning to you because they were chosen/computed by R to make all the contrasts linearly independent of one another.
Therefore, your model in the R output would be:
logit(p) = -3.57 + -2.21 * Year.L + -0.93 * Year.Q + ... + -0.15 * Year^16
where p is the probability of presence of the characteristic of interest, and the logit transformation is defined as the logged odds where odds = p / (1 - p) and logged odds = ln(odds). Therefore logit(p) = ln(p / (1 - p)).
The interpretation of a particular beta test is then generalized to: Which contrasts contribute significantly to explain any differences between levels in your dependent variable? Because your Year.L predictor is significant and negative, this suggests a linear decreasing trend in logit across years, and because your Year.Q predictor is significant and negative, this suggests a deacceleration trend is detectable in the pattern of logits across years. Third order polynomials model jerk, and fourth order polynomials model jounce (a.k.a., snap). However, I would stop interpreting around this order and higher because it quickly becomes nonsensical to practical folk.
Similarly, to interpret a particular beta estimate is a bit nonsensical to me, but it would be that the odds of switching categories in your outcome at a given level of a particular contrast (e.g., quadratic) as compared to the odds of switching categories in your outcome at the given level of that contrast (e.g., quadratic) less one unit is equal to the odds ratio had by exponentiating the beta estimate. For the quadratic contrast in your example, the odds ratio would be exp(-0.9328) = 0.3935, but I say it's a bit nonsensical because the units have little practical meaning as they were chosen by R to make the predictors linearly independent from one another. Thus I prefer focusing on the interpretation of a given contrast test as opposed to the coefficient in this circumstance.
For further reading, here is a webpage at UCLA's wonderful IDRE that discusses how to interpret odds ratios in logistic regression, and here is a crazy cool but intense stack exchange answer that walks through how R chooses the polynomial contrast weights.
Previously I have had no issues when using lsmeans to identify significant differences between groups while controlling for other factors using lme4 models. However, with the following dataset looking at fluoresence lsmeans produces identical p values regardless of other factor levels
A subset of the data used in this exmample can be found here:
https://drive.google.com/file/d/0B3-esLisG8EbTzA3cjVpRGtjREU/view?usp=sharing
Data
Response(s): here 1/0 presence/absence. (but also average pixel intensity and cbind precentage cover)
Fixed factor 1: heat treatment - 2 levels
Fixed factor 2: competition treatment - 2 levels
Fixed factor 3: time treatment - 2 levels
Random factor: none
Model creation
library(lme4)
model<-glm(presence ~ heat.treatment + competition.treatment + time.post.mating.hrs, binomial(link= "logit"), data=gfptest)
Orginally interaction terms were included but their presence was non-significant based on AIC testing. Using drop1 for significance testing on fixed factor removal heat is important
drop1(model, test= "Chi")
# presence ~ heat.treatment + competition.treatment + time.post.mating.hrs
# Df Deviance AIC LRT Pr(>Chi)
#<none> 30.589 38.589
#heat.treatment 1 39.114 45.114 8.5251 0.003503 **
#competition.treatment 1 30.876 36.876 0.2868 0.592297
#time.post.mating.hrs 1 32.410 38.410 1.8206 0.177237
I would like to test for the difference between control and heat treaments while controlling for the competition treatment and time treatment eg. is presence significantly different between controls and heats at timepoint 0.5 hours and no competition, is presence significantly different between controls and heats at timepoint 24 hours and no competition, etc. I've tried lsmeans functions (multcomp yeilds similar results)
lsmeans(model, pairwise~heat.treatment+competition.treatment+time.post.mating.hrs, adjust="tukey")
and more explicitly
model <- lsmeans(model, "heat.treatment", by = "competition.treatment", at = list( time.post.mating.hrs = "0.5"))
modelsum<- summary(model, infer= c(TRUE,TRUE), level= .90, adjust= "bon", by="competition.treatment")
modelsum
pairs(model)
However, both give identical p values within each group combination; something which does not seem accurate when looking at boxplots and doing pairwise mann-whitney-U tests on data ranks
$contrasts
contrast estimate SE df z.ratio p.value
control,single,0.5 - heat,single,0.5 18.3718560 2224.3464134 NA 0.008 1.0000
control,competition,0.5 - heat,competition,0.5 18.3718560 2224.3464134 NA 0.008 1.0000
control,single,24 - heat,single,24 18.3718560 2224.3464134 NA 0.008 1.0000
control,competition,24 - heat,competition,24 18.3718560 2224.3464134 NA 0.008 1.0000
I have tried exploring the dataframe to eliminate the cause of identical p-values. The issue is still apparent with reducing the number of factors to two and using a different response variable/error distribution.
Any help with resolving the lsmean/similar package issue would be appreciated. As a secondary option any advice on whether its acceptable to do poisson/binomial glm()s then follow up post-hoc testing with t-test/mann-whitneys
It seems very odd that you would note that the P values are all the same and apparently did not notice that the estimated differences and standard errors are all the same. Most people look at those first. (If you didn't, I recommend it highly; we should talk about effects and such, the P values are just accessories.)
Anyway, the explanation is that you fitted an additive model -- one with no interactions. Such a model specifies that the effect of one factor is exactly the same, regardless of the levels of any other factor. And that is exactly what you are seeing.
In short, this has nothing to do with lsmeans() and everything to do with the model you fitted.
I'm doing a multi-linear regression model using lm(), Y is response variable (e.g.: return of interests) and others are explanatory variable (100+ cases, 30+ variables).
There are certain variables which are considered as key variables (concerning investment), when I ran the lm() function, R returns a model with adj.r.square of 97%. But some of the key variables are not significant predictors.
Is there a way to do a regression by keeping all of the key variables in the model (as significant predictors)? It doesn't matter if the adjusted R square decreases.
If the regression doesn't work, is there other methodology?
thank you!
==========================
the data set is uploaded
https://www.dropbox.com/s/gh61obgn2jr043y/df.csv
==========================
additional questions:
what if some variables have impact from previous period to current period?
Example: one takes a pill in the morning when he/she has breakfast and the effect of pills might last after lunch (and he/she takes the 2nd pill at lunch)
I suppose I need to take consideration of data transformation.
* My first choice is to plus a carry-over rate: obs.2_trans = obs.2 + c-o rate * obs.1
* Maybe I also need to consider the decay of pill effect itself, so a s-curve or a exponential transformation is also necessary.
take variable main1 for example, I can use try-out method to get an ideal c-o rate and s-curve parameter starting from 0.5 and testing by step of 0.05, up to 1 or down to 0, until I get the highest model score - say, lowest AIC or highest R square.
This is already a huge quantity to test.
If I need to test more than 3 variables in the same time, how could I manage that by R?
Thank you!
First, a note on "significance". For each variable included in a model, the linear modeling packages report the likelihood that the coefficient of this variable is different from zero (actually, they report p=1-L). We say that, if L is larger (smaller p), then the coefficient is "more significant". So, while it is quite reasonable to talk about one variable being "more significant" than another, there is no absolute standard for asserting "significant" vs. "not significant". In most scientific research, the cutoff is L>0.95 (p<0.05). But this is completely arbitrary, and there are many exceptions. recall that CERN was unwilling to assert the existence of the Higgs boson until they had collected enough data to demonstrate its effect at 6-sigma. This corresponds roughly to p < 1 × 10-9. At the other extreme, many social science studies assert significance at p < 0.2 (because of the higher inherent variability and usually small number of samples). So excluding a variable from a model because it is "not significant" really has no meaning. On the other hand you would be hard pressed to include a variable with high p while excluding another variable with lower p.
Second, if your variables are highly correlated (which they are in your case), then it is quite common that removing one variable from a model changes all the p-values greatly. A retained variable that had a high p-value (less significant), might suddenly have low p-value (more significant), just because you removed a completely different variable from the model. Consequently, trying to optimize a fit manually is usually a bad idea.
Fortunately, there are many algorithms that do this for you. One popular approach starts with a model that has all the variables. At each step, the least significant variable is removed and the resulting model is compared to the model at the previous step. If removing this variable significantly degrades the model, based on some metric, the process stops. A commonly used metric is the Akaike information criterion (AIC), and in R we can optimize a model based on the AIC criterion using stepAIC(...) in the MASS package.
Third, the validity of regression models depends on certain assumptions, especially these two: the error variance is constant (does not depend on y), and the distribution of error is approximately normal. If these assumptions are not met, the p-values are completely meaningless!! Once we have fitted a model we can check these assumptions using a residual plot and a Q-Q plot. It is essential that you do this for any candidate model!
Finally, the presence of outliers frequently distorts the model significantly (almost by definition!). This problem is amplified if your variables are highly correlated. So in your case it is very important to look for outliers, and see what happens when you remove them.
The code below rolls this all up.
library(MASS)
url <- "https://dl.dropboxusercontent.com/s/gh61obgn2jr043y/df.csv?dl=1&token_hash=AAGy0mFtfBEnXwRctgPHsLIaqk5temyrVx_Kd97cjZjf8w&expiry=1399567161"
df <- read.csv(url)
initial.fit <- lm(Y~.,df[,2:ncol(df)]) # fit with all variables (excluding PeriodID)
final.fit <- stepAIC(initial.fit) # best fit based on AIC
par(mfrow=c(2,2))
plot(initial.fit) # diagnostic plots for base model
plot(final.fit) # same for best model
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.38360 18.25028 0.624 0.53452
# Main1 911.38514 125.97018 7.235 2.24e-10 ***
# Main3 0.04424 0.02858 1.548 0.12547
# Main5 4.99797 1.94408 2.571 0.01195 *
# Main6 0.24500 0.10882 2.251 0.02703 *
# Sec1 150.21703 34.02206 4.415 3.05e-05 ***
# Third2 -0.11775 0.01700 -6.926 8.92e-10 ***
# Third3 -0.04718 0.01670 -2.826 0.00593 **
# ... (many other variables included)
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 22.76 on 82 degrees of freedom
# Multiple R-squared: 0.9824, Adjusted R-squared: 0.9779
# F-statistic: 218 on 21 and 82 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(initial.fit)
title("Base Model",outer=T,line=-2)
plot(final.fit)
title("Best Model (AIC)",outer=T,line=-2)
So you can see from this that the "best model", based on the AIC metric, does in fact include Main 1,3,5, and 6, but not Main 2 and 4. The residuals plot shows no dependance on y (which is good), and the Q-Q plot demonstrates approximate normality of the residuals (also good). On the other hand the Leverage plot shows a couple of points (rows 33 and 85) with exceptionally high leverage, and the Q-Q plot shows these same points and row 47 as having residuals not really consistent with a normal distribution. So we can re-run the fits excluding these rows as follows.
initial.fit <- lm(Y~.,df[c(-33,-47,-85),2:ncol(df)])
final.fit <- stepAIC(initial.fit,trace=0)
summary(final.fit)
# ...
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 27.11832 20.28556 1.337 0.185320
# Main1 1028.99836 125.25579 8.215 4.65e-12 ***
# Main2 2.04805 1.11804 1.832 0.070949 .
# Main3 0.03849 0.02615 1.472 0.145165
# Main4 -1.87427 0.94597 -1.981 0.051222 .
# Main5 3.54803 1.99372 1.780 0.079192 .
# Main6 0.20462 0.10360 1.975 0.051938 .
# Sec1 129.62384 35.11290 3.692 0.000420 ***
# Third2 -0.11289 0.01716 -6.579 5.66e-09 ***
# Third3 -0.02909 0.01623 -1.793 0.077060 .
# ... (many other variables included)
So excluding these rows results in a fit that has all the "Main" variables with p < 0.2, and all except Main 3 at p < 0.1 (90%). I'd want to look at these three rows and see if there is a legitimate reason to exclude them.
Finally, just because you have a model that fits your existing data well, does not mean that it will perform well as a predictive model. In particular, if you are trying to make predictions outside of the "model space" (equivalent to extrapolation), then your predictive power is likely to be poor.
Significance is determined by the relationships in your data .. not by "I want them to be significant".
If the data says they are insignificant, then they are insignificant.
You are going to have a hard time getting any significance with 30 variables, and only 100 observations. With only 100+ observations, you should only be using a few variables. With 30 variables, you'd need 1000's of observations to get any significance.
Maybe start with the variables you think should be significant, and see what happens.
I have below x and y value and as you see x is mostly negative, basically I only have the left side of the PDF of my observed data.
I have to fit it with a student distribution, and find out the degree of freedom and scale parameter.
The problem is, the estimated distribution is gonna have a very small variance (ie. small scale parameter). So when I use the below method to fit the distribution, the nls fails to converge no matter what initial values I set.
I have used an extra parameter c in the below code because I scale the distribution by using this: dt(x/a,df). Therefore, in order to conserve the probability, I unavoidably have to time the output but a constant. I believe this extra parameter leads to a poor convergence, but I have no idea how to fit the distribution in a better way.
I have looked for distribution fitting package, but those packages require a complete distribution while I only have the left side of it.
x y
1 -0.0050 0.000000
2 -0.0045 26.723019
3 -0.0040 28.557704
4 -0.0035 41.085068
5 -0.0030 66.258445
6 -0.0025 81.129807
7 -0.0020 83.751611
8 -0.0015 130.378353
9 -0.0010 157.806018
10 -0.0005 201.505657
11 0.0000 949.650354
12 0.0005 193.721270
dat<-data.frame(x=x,y=y)
res<-nls( y~(dt(x/a,df)*c), dat,
start=list(a=0.000201, df=0.9, c=2104), trace = TRUE)